List Info

Thread: document support for file system crawling




document support for file system crawling
user name
2006-08-30 06:42:39
Hi there,

browsing through the message thread I tried to find a trail
addressing file
system crawls. I want to implement an enterprise search over
a networked
filesystem, crawling all sorts of documents, such as html,
doc, ppt and pdf.
Nutch provides plugins enabling it to read proprietary
formats. 
Is there support for the same functionality in solr?

Bruno
-- 
View this message in context: http://www.nabble.com/doc
ument-support-for-file-system-crawling-tf2188066.html#a60533
18
Sent from the Solr - User forum at Nabble.com.

document support for file system crawling
user name
2006-08-30 10:03:03
On Aug 30, 2006, at 2:42 AM, Bruno wrote:
> browsing through the message thread I tried to find a
trail  
> addressing file
> system crawls. I want to implement an enterprise search
over a  
> networked
> filesystem, crawling all sorts of documents, such as
html, doc, ppt  
> and pdf.
> Nutch provides plugins enabling it to read proprietary
formats.
> Is there support for the same functionality in solr?

No.  Solr is strictly a search server that takes plain text
for the  
fields of documents added to it.  The client is responsible
parsing  
the text out of these types of documents.  You could borrow
the  
document parsing pieces from Lucene's contrib and Nutch and
glue them  
together into your client that speaks to Solr, or perhaps
Solr isn't  
the right approach for your needs?   It certainly is
possible to add  
these capabilities into Solr, but it would be awkward to
have to  
stream binary data into XML documents such that Solr could
parse them  
on the server side.

	Erik


document support for file system crawling
user name
2006-08-30 17:20:29
: the text out of these types of documents.  You could
borrow the
: document parsing pieces from Lucene's contrib and Nutch
and glue them
: together into your client that speaks to Solr, or perhaps
Solr isn't
: the right approach for your needs?   It certainly is
possible to add
: these capabilities into Solr, but it would be awkward to
have to
: stream binary data into XML documents such that Solr could
parse them
: on the server side.

Agreed.  Solr's focus is in indexing "Structured
Data".  The support for
dynamic fields certainly allows you do deal with complex
structured data,
and somewhat heterogeneous structured data -- but it's
still structured
data.  If your goal is to do a lot of crawling of disparat
physical
documents, extract the text, and build a
"path,title,content" index
then Nutch is probably your best bet.


-Hoss

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )