List Info

Thread: Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?




Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?
user name
2007-09-23 10:38:17
(Copied from nutch-user, this is more a dev topic now)
> It's not an issue with readseg or readlinkdb
themselves, because a  
> segment fetched in the older nutch (using the exact
same  
> configuration) expels png links in trunk's readlinkdb.
It appears  
> the fetcher now only parses URLs that pass the filters
into the  
> segment.


I checked the diffs from my old version (mid-December 06)
and trunk  
ParseOutputFormat. It appears now that the parse puts the
outlink  
URLs through the URLFilters. I confirmed this by taking out
.png from  
my URLFilters and re-running a crawl -- pngs now appear in
the  
readlinkdb.

1) Was it a bug that URLs that would not pass URLFilters got
into the  
linkdb for analysis?

2) If so, why is there a -noFilter option for readlinkdb?
The linkdb  
has already been filtered whether you like it or not.
-noFilter will  
never have any effect.

There needs to be a way to have the linkdb reflect all URLs 

(unfiltered) for further analysis. I suggest a
-noFilterOutlinks  
(default off) in the fetch command (as the default behavior
of fetch  
is to parse.) This would simply not call the filter in  
ParseOutputFormat, if my theory is correct.





Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?
user name
2007-09-23 10:43:44
On Sep 23, 2007, at 11:38 AM, Brian Whitman wrote:
>
> 2) If so, why is there a -noFilter option for
readlinkdb?
>

mistake, change this to

> 2) If so, why is there a -noFilter option for
invertlinks?



[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )