On Aug 30, 2006, at 2:42 AM, Bruno wrote:
> browsing through the message thread I tried to find a
trail
> addressing file
> system crawls. I want to implement an enterprise search
over a
> networked
> filesystem, crawling all sorts of documents, such as
html, doc, ppt
> and pdf.
> Nutch provides plugins enabling it to read proprietary
formats.
> Is there support for the same functionality in solr?
No. Solr is strictly a search server that takes plain text
for the
fields of documents added to it. The client is responsible
parsing
the text out of these types of documents. You could borrow
the
document parsing pieces from Lucene's contrib and Nutch and
glue them
together into your client that speaks to Solr, or perhaps
Solr isn't
the right approach for your needs? It certainly is
possible to add
these capabilities into Solr, but it would be awkward to
have to
stream binary data into XML documents such that Solr could
parse them
on the server side.
Erik
|