Hi Payo,
You need to add the right plugin to your nutch configuration
file. Here is an extraction from my installation:
NUTCH_HOMEconfnutch-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"
href="configuration.xsl"?>
<configuration>
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|ontology|protocol-ftp|pro
tocol-httpclient|urlfilter-regex|parse-(text|html|pdf|rtf|ms
word|js|mspowerpoint|msexcel|oo|rss)|index-(basic|more)|quer
y-(basic|site|url|more)|summary-lucene|scoring-opic</valu
e>
</property>
...
Using the above configuration, I am able to index text,
html, pbd, excel, etc.
Not sure about XML, I think there is already an enhacement
request for this in JIRA.
I hope this helps,
Sergio
----- Original Message ----
From: payo <payo22 yahoo.com>
To: nutch-user lucene.apache.org
Sent: Friday, 19 October, 2007 4:16:20 PM
Subject: Re: Indexing documents
Goethe wrote:
>
>
>
> payo wrote:
>>
>> Hi
>>
>> my questions are
>>
>> 1.- Nutch can index documents PDF, HTML and XML?
>>
>> 2.- Nutxh can index remote documents?
>>
>> thanks
>>
>
> Yes to both questions, and for the first question Nutch
already comes with
> the plugins necessary to index those files types.
>
>
where i can obtain information on this?
--
View this message in context: http://www.nabble.com/Indexing-documents-tf4653
264.html#a13295436
Sent from the Nutch - User mailing list archive at
Nabble.com.
___________________________________________________________
Want ideas for reducing your carbon footprint? Visit Yahoo!
For Good http://uk.promotions.yahoo.com/forgood/environment.html
a> |