List Info

Thread: content-type crawling problem




content-type crawling problem
user name
2006-05-29 12:43:53
Heiko Dietze wrote:
> Hello,
> 
> Eugen Kochuev wrote:
>> Btw, do I need to uncomment this? It's more
logical to comment this
>> out. Right?
>>
>>
>>> <mimeType name="*">
>>>   <plugin id="parse-text" />
>>> </mimeType>
>>
>>
>>> Just uncomment this wilcard match. You might
also check
>>> the other rules for further unwanted content.
> 
> Sorry for the typo, I meant that you should leave it
out, yes.
> 
> Unfortunaly for the fetching of the pages this is not
the solution, but
> the index will be based only on the proper content. I
think the index is
> created with the parsed content.

Maybe have a look at urlfilter-suffix and only fetch those
files with
suffixes you want.


Regards,
 Stefan
FieldQueryFilter vs RawFieldQueryFilter
user name
2006-05-29 13:11:43
Hi,
I'm writing some plugins for nutch and some things are
killing me. 
Can someone explain the difference between field and raw
field ..

When I use LUKE, all queries work like a charm, but they
return 0 results
trough nutch search..


Basically when should I have this as a query plugin:

-----
import org.apache.nutch.searcher.RawFieldQueryFilter;
public class HeadlineQueryFilter extends RawFieldQueryFilter
{
	public HeadlineQueryFilter() {
		super("headline");
	}
}
------

And when:

-------
import org.apache.nutch.searcher.FieldQueryFilter;
public class HeadlineQueryFilter extends FieldQueryFilter {
  public HeadlineQueryFilter() {
    super("headline");
  }
}

-------
???

The indexing filter is:

----
   if (headline != null) {
        //doc.add(Field.Keyword("headline",
headline));
    	doc.add(new Field("headline", headline,
Field.Store.YES,
Field.Index.TOKENIZED));
    	LOG.info("Headline added");
    } else{
      	LOG.info("Headline not found");
    }
----

Thanx in advance
Bogdan

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )