Matt,
(I didn't notice your message until this morning...)
I implemented this as well but I am using Nutch 0.8 so that
the generated
Lucene index version matches what our front end searcher is
using. I
implemented this in a manner similar to scoring. The first
step was to add
a crawl depth field to CrawlDatum. When I crawl a page I
insert the current
page depth into the content metadata, then in
ParseOutputFormat I inspect
the value from the content metadata and either choose not
crawl the links on
this page or I update the crawl depth for each crawl datum.
I am planning to submit this back to the Nutch community as
a patch once we
have it tested a bit more. I am really new to the Nutch
codebase so there
may be a better way to do it. Any advice, suggestions or
comments would be
awesome.
Cheers,
Vince
On 8/16/07, Matt Kangas <kangas gmail.com> wrote:
>
> Vince, I have implemented crawl-depth limits in Nutch
0.7. Because
> there was no crawldb metadata support (yet), I had to
store crawl-
> depth in a custom db (mapfile), and add a processing
step to the
> crawl cycle.
>
> We had this working for ~30k base URLs and depth
limited to 4-20
> links (min 4, extended when our feature-detector found
relevant
> content).
>
> I have a port of this code to Nutch 0.9, not written by
me and not
> stress-tested. If you're interested, I can see if it's
possible to
> release the source, and adapt it as a patch to the
latest Nutch
> codebase.
>
> (AFAIK, this functionality is essential for any
vertical search
> engine. Which is to say, any search startup that wants
to succeed by
> not attacking Google head-on, IMHO.)
>
> --Matt Kangas
>
>
> On Aug 14, 2007, at 11:47 AM, Vince Filby wrote:
>
> > I have a list of, say, 8 million URL's that I will
need to crawl
> > with Nutch
> > and I will also need to freshen these URL's on a
regular basis (I
> > will not
> > be following external links though). Since I have
so many URL's I
> > would
> > like to crawl breadth first and restrict the depth
to say 3 or 4
> > levels. I
> > also want to be able to inject new URL's at any
time and have Nutch
> > automagically start crawling to the appropriate
depth. In the intranet
> > recrawl script, the depth is represented by a new
segment with all the
> > available links from the previous segment. With
the large amount
> > of pages I
> > will be crawling I would like to restrict the
segment size to a
> > something
> > that can be crawled in a few hours so I can
constantly maintain a
> > fresh
> > index.
> >
> > How can I control depth with a much larger crawl,
especially when
> > there will
> > be brand new URL's thrown into the mix later on?
> >
> > Any advice on this topic would be greatly
appreciated,
> > Vince
>
> --
> Matt Kangas / kangas gmail.com
>
>
>
|