I totally agree with you: IFilter is the right way to go. In
the past
I've found this reference a good starting point:
http://
www.codeproject.com/csharp/IFilter.asp
I've used it to gather text from PDF documents but it does
apply to Word
ones as well. Pay attention to the fact that some IFilter
components are
not reentrant (most notably Adobe's); so, in a multithreaded
environment
(like ASP.Net), you should find a workaround to make it work
fine. Btw
MS Word's IFilter should work well.
HTH,
Efran Cobisi
http://www.cobisi.com
Marc Brooks wrote:
>> Are there any issues if I just do a rename of the
word doc from
>> file.doc to
>> file.txt, then open the file as a text document and
parse if for the
>> data I
>> need? I know that the Word document format is not
in strait ASCII
>> text, but
>> it appears that the data itself is.
>
> That is TOTALLY wrong, no offense... the Word document
format is
> actually a structured-storage document composed of a
tree of elements
> and each element is a list of text snippets (some used,
some old
> noise) in a nonlinear linked list. If you simply do a
"strings" on the
> file, you'll end up with a lot of unrelated text in
apparently random
> order. Some of that text can even be from another
unrelated document,
> or prior versions of the document (or template it was
horked from).
>
> Seriously, if you want the text from a doc file, use
IFilter. If you
> need a .Net version just say so.
>
> --
> "I am Dyslexic of Borg. Resistors are fertile.
Prepare to have your
> ass laminated." -- Dan Nitschke
>
> Marc C. Brooks
> http://musingmarc.blog
spot.com
>
> ===================================
> This list is hosted by DevelopMentorŪ http://www.develop.com
>
> View archives and manage your subscription(s) at
> http://discuss.develop.com
===================================
This list is hosted by DevelopMentorŪ http://www.develop.com
View archives and manage your subscription(s) at http://discuss.develop.com
|