List Info

Thread: Fw: Indexer does not update the field "TITLE" of Lucene when processing specific html documents




Fw: Indexer does not update the field "TITLE" of Lucene when processing specific html documents
country flaguser name
United States
2007-10-19 02:28:42

Hi,

 

I have upgraded from NUTCH 9.0 to nutch-2007-09-30_04-01-28.tar.gz.

 ;

It seems the indexer is unable to update the field "TITLE" of the Lucene index when processing specific html documents.

 

 

Please find below a brief summay of this issue:

 

1.- Extracted this new version in a separate directory and copy across the following configuration files:
- {nutch_home_9.0}/bin/url folder, containing the urls
- {nutch_home_9.0}/conf/nutch-site.xml
- {nutch_home_9.0}/conf/crawl-urlfilter.txt

 

2.- To reproduce the issue, you would need to copy the attached html document to your webserver/filesytem.

 

3.- Run the crawl using the following command.
./nutch crawl urls -dir crawl -depth 22

 ;

4.- Open the index using Luke.

 

5.- Select the "document" tab, move thru the docs until you find the above document.
You will see that the TITLE field is empty ; --> INCORRECT because this html document contains a title.

 

6.- Now, open the html document, add a space anywhere then save it again.

 

7.- Repeat step 3 and 4.


You will notice that this time the field "TITLE" field contains the correct information

 

This problem does NOT occurs using NUTCH 9.0

 

Please advice,

 

Many thanks in advance for your support

 

Serg



For ideas on reducing your carbon footprint visit Yahoo! For Good this month.
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )