List Info

Thread: Depth restriction on large crawls




Depth restriction on large crawls
user name
2007-08-14 10:47:18
I have a list of, say, 8 million URL's that I will need to
crawl with Nutch
and I will also need to freshen these URL's on a regular
basis (I will not
be following external links though).  Since I have so many
URL's I would
like to crawl breadth first and restrict the depth to say 3
or 4 levels.  I
also want to be able to inject new URL's at any time and
have Nutch
automagically start crawling to the appropriate depth. In
the intranet
recrawl script, the depth is represented by a new segment
with all the
available links from the previous segment.  With the large
amount of pages I
will be crawling I would like to restrict the segment size
to a something
that can be crawled in a few hours so I can constantly
maintain a fresh
index.

How can I control depth with a much larger crawl, especially
when there will
be brand new URL's thrown into the mix later on?

Any advice on this topic would be greatly appreciated,
Vince
Re: Depth restriction on large crawls
country flaguser name
United States
2007-08-16 17:45:00
Vince, I have implemented crawl-depth limits in Nutch 0.7.
Because  
there was no crawldb metadata support (yet), I had to store
crawl- 
depth in a custom db (mapfile), and add a processing step to
the  
crawl cycle.

We had this working for ~30k base URLs and depth limited to
4-20  
links (min 4, extended when our feature-detector found
relevant  
content).

I have a port of this code to Nutch 0.9, not written by me
and not  
stress-tested. If you're interested, I can see if it's
possible to  
release the source, and adapt it as a patch to the latest
Nutch  
codebase.

(AFAIK, this functionality is essential for any vertical
search  
engine. Which is to say, any search startup that wants to
succeed by  
not attacking Google head-on, IMHO.)

--Matt Kangas


On Aug 14, 2007, at 11:47 AM, Vince Filby wrote:

> I have a list of, say, 8 million URL's that I will need
to crawl  
> with Nutch
> and I will also need to freshen these URL's on a
regular basis (I  
> will not
> be following external links though).  Since I have so
many URL's I  
> would
> like to crawl breadth first and restrict the depth to
say 3 or 4  
> levels.  I
> also want to be able to inject new URL's at any time
and have Nutch
> automagically start crawling to the appropriate depth.
In the intranet
> recrawl script, the depth is represented by a new
segment with all the
> available links from the previous segment.  With the
large amount  
> of pages I
> will be crawling I would like to restrict the segment
size to a  
> something
> that can be crawled in a few hours so I can constantly
maintain a  
> fresh
> index.
>
> How can I control depth with a much larger crawl,
especially when  
> there will
> be brand new URL's thrown into the mix later on?
>
> Any advice on this topic would be greatly appreciated,
> Vince

--
Matt Kangas / kangasgmail.com



Re: Depth restriction on large crawls
user name
2007-08-21 09:25:58
Matt,

(I didn't notice your message until this morning...)

I implemented this as well but I am using Nutch 0.8 so that
the generated
Lucene index version matches what our front end searcher is
using.  I
implemented this in a manner similar to scoring.  The first
step was to add
a crawl depth field to CrawlDatum.  When I crawl a page I
insert the current
page depth into the content metadata, then in
ParseOutputFormat I inspect
the value from the content metadata and either choose not
crawl the links on
this page or I update the crawl depth for each crawl datum.

I am planning to submit this back to the Nutch community as
a patch once we
have it tested a bit more.  I am really new to the Nutch
codebase so there
may be a better way to do it.  Any advice, suggestions or
comments would be
awesome.

Cheers,
Vince



On 8/16/07, Matt Kangas <kangasgmail.com> wrote:
>
> Vince, I have implemented crawl-depth limits in Nutch
0.7. Because
> there was no crawldb metadata support (yet), I had to
store crawl-
> depth in a custom db (mapfile), and add a processing
step to the
> crawl cycle.
>
> We had this working for ~30k base URLs and depth
limited to 4-20
> links (min 4, extended when our feature-detector found
relevant
> content).
>
> I have a port of this code to Nutch 0.9, not written by
me and not
> stress-tested. If you're interested, I can see if it's
possible to
> release the source, and adapt it as a patch to the
latest Nutch
> codebase.
>
> (AFAIK, this functionality is essential for any
vertical search
> engine. Which is to say, any search startup that wants
to succeed by
> not attacking Google head-on, IMHO.)
>
> --Matt Kangas
>
>
> On Aug 14, 2007, at 11:47 AM, Vince Filby wrote:
>
> > I have a list of, say, 8 million URL's that I will
need to crawl
> > with Nutch
> > and I will also need to freshen these URL's on a
regular basis (I
> > will not
> > be following external links though).  Since I have
so many URL's I
> > would
> > like to crawl breadth first and restrict the depth
to say 3 or 4
> > levels.  I
> > also want to be able to inject new URL's at any
time and have Nutch
> > automagically start crawling to the appropriate
depth. In the intranet
> > recrawl script, the depth is represented by a new
segment with all the
> > available links from the previous segment.  With
the large amount
> > of pages I
> > will be crawling I would like to restrict the
segment size to a
> > something
> > that can be crawled in a few hours so I can
constantly maintain a
> > fresh
> > index.
> >
> > How can I control depth with a much larger crawl,
especially when
> > there will
> > be brand new URL's thrown into the mix later on?
> >
> > Any advice on this topic would be greatly
appreciated,
> > Vince
>
> --
> Matt Kangas / kangasgmail.com
>
>
>
[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )