List Info

Thread: Indexing documents




Indexing documents
country flaguser name
United States
2007-10-19 08:51:06
Hi

my questions are

1.- Nutch can index documents PDF, HTML and XML?

2.- Nutxh can index remote documents?

thanks
-- 
View this message in context: http://www.nabble.com/Indexing-documents-tf4653
264.html#a13294769
Sent from the Nutch - User mailing list archive at
Nabble.com.


Re: Indexing documents
country flaguser name
United States
2007-10-19 09:02:27


payo wrote:
> 
> Hi
> 
> my questions are
> 
> 1.- Nutch can index documents PDF, HTML and XML?
> 
> 2.- Nutxh can index remote documents?
> 
> thanks
> 

Yes to both questions, and for the first question Nutch
already comes with
the plugins necessary to index those files types.

-- 
View this message in context: http://www.nabble.com/Indexing-documents-tf4653
264.html#a13295157
Sent from the Nutch - User mailing list archive at
Nabble.com.


Re: Indexing documents
country flaguser name
United States
2007-10-19 09:16:20


Goethe wrote:
> 
> 
> 
> payo wrote:
>> 
>> Hi
>> 
>> my questions are
>> 
>> 1.- Nutch can index documents PDF, HTML and XML?
>> 
>> 2.- Nutxh can index remote documents?
>> 
>> thanks
>> 
> 
> Yes to both questions, and for the first question Nutch
already comes with
> the plugins necessary to index those files types.
> 
> 

where i can obtain information on this?

-- 
View this message in context: http://www.nabble.com/Indexing-documents-tf4653
264.html#a13295436
Sent from the Nutch - User mailing list archive at
Nabble.com.


Re: Indexing documents
country flaguser name
United States
2007-10-19 15:22:57
where are you from Sergio?




Sergio Morales wrote:
> 
> Hi Payo,
> 
> You need to add the right plugin to your nutch
configuration file. Here is
> an extraction from my installation:
> 
> NUTCH_HOMEconfnutch-site.xml:
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl"
href="configuration.xsl"?>
> <configuration>
>  <property>
>    <name>plugin.includes</name>
>   
>
<value>nutch-extensionpoints|ontology|protocol-ftp|pro
tocol-httpclient|urlfilter-regex|parse-(text|html|pdf|rtf|ms
word|js|mspowerpoint|msexcel|oo|rss)|index-(basic|more)|quer
y-(basic|site|url|more)|summary-lucene|scoring-opic</valu
e>
>  </property>
> ...
> 
> Using the above configuration, I am able to index text,
html, pbd, excel,
> etc.
> 
> Not sure about XML, I think there is already an
enhacement request for
> this in JIRA. 
> 
> I hope this helps,
> 
> Sergio
> 
> ----- Original Message ----
> From: payo <payo22yahoo.com>
> To: nutch-userlucene.apache.org
> Sent: Friday, 19 October, 2007 4:16:20 PM
> Subject: Re: Indexing documents
> 
> 
> 
> 
> Goethe wrote:
>> 
>> 
>> 
>> payo wrote:
>>> 
>>> Hi
>>> 
>>> my questions are
>>> 
>>> 1.- Nutch can index documents PDF, HTML and
XML?
>>> 
>>> 2.- Nutxh can index remote documents?
>>> 
>>> thanks
>>> 
>> 
>> Yes to both questions, and for the first question
Nutch already comes
>> with
>> the plugins necessary to index those files types.
>> 
>> 
> 
> where i can obtain information on this?
> 
> -- 
> View this message in context:
> http://www.nabble.com/Indexing-documents-tf4653
264.html#a13295436
> Sent from the Nutch - User mailing list archive at
Nabble.com.
> 
> 
>      
___________________________________________________________

> Want ideas for reducing your carbon footprint? Visit
Yahoo! For Good 
> http://uk.promotions.yahoo.com/forgood/environment.html
> 

-- 
View this message in context: http://www.nabble.com/Indexing-documents-tf4653
264.html#a13302250
Sent from the Nutch - User mailing list archive at
Nabble.com.


[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )