List Info

Thread: Parallelizing Fetches




Parallelizing Fetches
user name
2006-11-03 17:26:34
On 11/3/06, Harry Fuecks <hfuecksgmail.com> wrote:
> > Joe Gregorio is looking into parallelizing the
HTTP fetches; taking a
> > single feed and processing it multiple times is
also an obvious
> > optimization.  But for now, yes, that would mean
multiple fetches.
>
> I guess Joe is already a long way into this but perhaps
this is
> interesting anyway;
>
> http://blog.bitflux.ch/archive/2005/10
/28/how-to-fetch-a-lot-of-feeds.html

Here is the branch I have been working on:

   http://bitworking.org/projects/venus/branches/threaded/

This branch includes httplib2 to handle the fetching. I have
added a
new config option 'spider_threads' that you can set to the
number of
threads you want to use when spidering. The default is 0.
When
spider_threads is set to zero httplib2 is not used and
feedparser
is used to fetch the feeds. Note that the threading only
applies
to HTTP(S) URIs, all other URI types are done in the main
thread
and handled by feedparser. All parsing is also handled only
in the main
thread.

The caching in httplib2 is used and is stored as 'http'
under the sources
cache directory.

The status of the code is 'under testing'. All of the
current unit tests
pass and I have run it successfully over my configurations
but it
definitely needs more testing, and it needs more unit tests.

Some preliminary numbers for ~60 feeds:
   config.spider_threads = 0     1m40s

   config.spider_threads = 10     30s

I did run a special test where I
remove line 375 of planet/spider.py
where the call to spiderFeed() is skipped if
the response from httplib2 is from the cache.
If that line is removed, forcing a
call to spiderFeed() for each and every
feed then the timing is:

   config.spider_threads = 0     1m40s

Of course, this is just the beginning for multi-threaded
spidering. What
should be added after this is stable is some code to make
the spidering
'nice'. That is, the code should look up IP addresses and
avoid hitting
the same server more than once every N seconds.

   Thanks,
   -joe

-- 
Joe Gregorio        http://bitworking.org
-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
Parallelizing Fetches
user name
2006-11-04 15:12:55
Joe Gregorio wrote:
> 
> Here is the branch I have been working on:
> 
>   http://bitworking.org/projects/venus/branches/threaded/
> 
> This branch includes httplib2 to handle the fetching. I
have added a
> new config option 'spider_threads' that you can set to
the number of
> threads you want to use when spidering. The default is
0. When
> spider_threads is set to zero httplib2 is not used and
feedparser
> is used to fetch the feeds. Note that the threading
only applies
> to HTTP(S) URIs, all other URI types are done in the
main thread
> and handled by feedparser. All parsing is also handled
only in the main
> thread.
> 
> The caching in httplib2 is used and is stored as 'http'
under the sources
> cache directory.
> 
> The status of the code is 'under testing'. All of the
current unit tests
> pass and I have run it successfully over my
configurations but it
> definitely needs more testing, and it needs more unit
tests.

I'm seeing errors:

     http://plan
et.intertwingly.net/planet.log

- Sam Ruby

P.S.  BeautifulSoup. compat_logging, feedparser, htmltmpl,
portalocker, 
and timeoutsocket are all in the planet directory.  Either
httplib 
should be moved into the planet directory, or all these
should be moved 
to the same place.
-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
Parallelizing Fetches
user name
2006-11-03 23:38:20
<quote who="Joe Gregorio">

> Of course, this is just the beginning for
multi-threaded spidering.

Rock! This is a long time desired feature.  Thank you!

- Jeff

-- 
linux.conf.au 2007: Sydney, Australia           http://lca2007.linux.org
.au/
-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
Parallelizing Fetches
user name
2006-11-04 22:25:10
On 11/4/06, Sam Ruby <rubysintertwingly.net> wrote:
> Joe Gregorio wrote:
> >
> > Here is the branch I have been working on:
> >
> >   http://bitworking.org/projects/venus/branches/threaded/
> >
> > This branch includes httplib2 to handle the
fetching. I have added a
> > new config option 'spider_threads' that you can
set to the number of
> > threads you want to use when spidering. The
default is 0. When
> > spider_threads is set to zero httplib2 is not used
and feedparser
> > is used to fetch the feeds. Note that the
threading only applies
> > to HTTP(S) URIs, all other URI types are done in
the main thread
> > and handled by feedparser. All parsing is also
handled only in the main
> > thread.
> >
> > The caching in httplib2 is used and is stored as
'http' under the sources
> > cache directory.
> >
> > The status of the code is 'under testing'. All of
the current unit tests
> > pass and I have run it successfully over my
configurations but it
> > definitely needs more testing, and it needs more
unit tests.
>
> I'm seeing errors:
>
>      http://plan
et.intertwingly.net/planet.log

There's actually two types of errors going on there.

The first is one where 301 redirects would break, but only
when the Location:
URI was relative, and then only on subsequent requests not
the first request. Now fixed.

The second type of problem was a failure to resolve the
server
name. Now fixed, that type of exception is now caught and
logged as an error.

>
> - Sam Ruby
>
> P.S.  BeautifulSoup. compat_logging, feedparser,
htmltmpl, portalocker,
> and timeoutsocket are all in the planet directory. 
Either httplib
> should be moved into the planet directory, or all these
should be moved
> to the same place.
>

That too has been fixed, moved into the planet directory.

  Thanks,
  -joe

-- 
Joe Gregorio        http://bitworking.org
-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )