List Info

Thread: Does anyone know how to read a Word document in .Net 2003?




Does anyone know how to read a Word document in .Net 2003?
user name
2006-12-12 05:11:32
> Are there any issues if I just do a rename of the word
doc from file.doc to
> file.txt, then open the file as a text document and
parse if for the data I
> need?  I know that the Word document format is not in
strait ASCII text, but
> it appears that the data itself is.

That is TOTALLY wrong, no offense... the Word document
format is
actually a structured-storage document composed of a tree of
elements
and each element is a list of text snippets (some used, some
old
noise) in a nonlinear linked list. If you simply do a
"strings" on the
file, you'll end up with a lot of unrelated text in
apparently random
order.  Some of that text can even be from another unrelated
document,
or prior versions of the document (or template it was horked
from).

Seriously, if you want the text from a doc file, use
IFilter. If you
need a .Net version just say so.

--
"I am Dyslexic of Borg. Resistors are fertile. Prepare
to have your
ass laminated." -- Dan Nitschke

Marc C. Brooks
http://musingmarc.blog
spot.com

===================================
This list is hosted by DevelopMentorŪ  http://www.develop.com

View archives and manage your subscription(s) at http://discuss.develop.com

Does anyone know how to read a Word document in .Net 2003?
user name
2006-12-12 08:24:00
I totally agree with you: IFilter is the right way to go. In
the past
I've found this reference a good starting point:
http://
www.codeproject.com/csharp/IFilter.asp

I've used it to gather text from PDF documents but it does
apply to Word
ones as well. Pay attention to the fact that some IFilter
components are
not reentrant (most notably Adobe's); so, in a multithreaded
environment
(like ASP.Net), you should find a workaround to make it work
fine. Btw
MS Word's IFilter should work well.

HTH,

Efran Cobisi
http://www.cobisi.com

Marc Brooks wrote:
>> Are there any issues if I just do a rename of the
word doc from
>> file.doc to
>> file.txt, then open the file as a text document and
parse if for the
>> data I
>> need?  I know that the Word document format is not
in strait ASCII
>> text, but
>> it appears that the data itself is.
>
> That is TOTALLY wrong, no offense... the Word document
format is
> actually a structured-storage document composed of a
tree of elements
> and each element is a list of text snippets (some used,
some old
> noise) in a nonlinear linked list. If you simply do a
"strings" on the
> file, you'll end up with a lot of unrelated text in
apparently random
> order.  Some of that text can even be from another
unrelated document,
> or prior versions of the document (or template it was
horked from).
>
> Seriously, if you want the text from a doc file, use
IFilter. If you
> need a .Net version just say so.
>
> --
> "I am Dyslexic of Borg. Resistors are fertile.
Prepare to have your
> ass laminated." -- Dan Nitschke
>
> Marc C. Brooks
> http://musingmarc.blog
spot.com
>
> ===================================
> This list is hosted by DevelopMentorŪ  http://www.develop.com
>
> View archives and manage your subscription(s) at
> http://discuss.develop.com


===================================
This list is hosted by DevelopMentorŪ  http://www.develop.com

View archives and manage your subscription(s) at http://discuss.develop.com

Does anyone know how to read a Word document in .Net 2003?
user name
2006-12-12 16:34:59
I agree that the IFilter is the best way to do for what I
need. But keep in
mind that I have to take a look at many options to make sure
that I am
recommending the correct solution for this client.  Since
the client is
driving this and they asked me to look at parsing the Word
doc as text, then
a certainly have to look into it.  That option does make
sense when you look
at the code, and I was able to write a quick parser to pull
out the data.
So it looked like a reasonable solution on the surface.

I think the IFilter is the best option so far.  It seems
easy enough to
implement and it handles everything that I need.  It's the
one I am
currently working to implement.

Best regards,
Jon

===================================
This list is hosted by DevelopMentorŪ  http://www.develop.com

View archives and manage your subscription(s) at http://discuss.develop.com

Does anyone know how to read a Word document in .Net 2003?
user name
2006-12-12 21:04:59
Ends up that the iFilter is a great way to handle this and
fits very well
with my need, as it easy connects to the parser.  It also
works for many of
the document types that this customer wants to support in
the future, as it
solves reading the data from most of the MS Office document
types.

I have one final issue that I am trying to figure out.  If
this application
is installed on a server without Word, will this work?  Does
the IFilter get
installed with Windows or with Word?  Can you deploy the
Word iFilter
without Word?  Or can I install Word, then uninstall it and
keep the Word
iFilter there?

I'm trying to figure out if I need to tell the client that
they have to
purchase and install a copy of each supported document
type... on the
server.  They are also looking at OneNote, which does have
an iFilter but
that iFilter is only installed when you install OneNote and
it seems that
you have to go in and manually configure the iFilter.  So it
would seem that
they have to buy a copy for the server and do some work on
their end before
we could deploy this app.

Does anyone know if this is true?  Is there a way around
buying separate
copies for the server or is that a must?  Can you install an
iFilter without
the parent product?

Thanks,
Jon

===================================
This list is hosted by DevelopMentorŪ  http://www.develop.com

View archives and manage your subscription(s) at http://discuss.develop.com

[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )