List Info

Thread: Parallelizing Fetches




Parallelizing Fetches
user name
2006-11-05 01:43:54
Joe Gregorio wrote:
> On 11/4/06, Sam Ruby <rubysintertwingly.net> wrote:
>> Joe Gregorio wrote:
>> >
>> > Here is the branch I have been working on:
>> >
>> >   http://bitworking.org/projects/venus/branches/threaded/
>> >
>> > This branch includes httplib2 to handle the
fetching. I have added a
>> > new config option 'spider_threads' that you
can set to the number of
>> > threads you want to use when spidering. The
default is 0. When
>> > spider_threads is set to zero httplib2 is not
used and feedparser
>> > is used to fetch the feeds. Note that the
threading only applies
>> > to HTTP(S) URIs, all other URI types are done
in the main thread
>> > and handled by feedparser. All parsing is also
handled only in the main
>> > thread.
>> >
>> > The caching in httplib2 is used and is stored
as 'http' under the 
>> sources
>> > cache directory.
>> >
>> > The status of the code is 'under testing'. All
of the current unit 
>> tests
>> > pass and I have run it successfully over my
configurations but it
>> > definitely needs more testing, and it needs
more unit tests.
>>
>> I'm seeing errors:
>>
>>      http://plan
et.intertwingly.net/planet.log
> 
> There's actually two types of errors going on there.
> 
> The first is one where 301 redirects would break, but
only when the 
> Location:
> URI was relative, and then only on subsequent requests
not
> the first request. Now fixed.

I'm still seeing this.  Furthermore, once such an error
occurs, it 
appears that that thread no longer services requests.

> The second type of problem was a failure to resolve the
server
> name. Now fixed, that type of exception is now caught
and logged as an 
> error.

Should I be able to use IRIs if spider_threads is set to a
non-zero value?

Also a new bug report: the way the change to the feed parser
was made 
causes it to not longer respect the default value for
xml:base.

- Sam Ruby
-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
Parallelizing Fetches
user name
2006-11-05 02:49:41
On 11/4/06, Sam Ruby <rubysintertwingly.net> wrote:
> I'm still seeing this.

You will have to clear out "sources/http", the
non-absolute
location: uri was stored there. The fixed version of
httplib2
was patched to store the right value, not to
absolutize the one retrieved from the cache.

> Furthermore, once such an error occurs, it
> appears that that thread no longer services requests.

Yes, any uncaught exceptions in a thread terminate the
thread.

>
> > The second type of problem was a failure to
resolve the server
> > name. Now fixed, that type of exception is now
caught and logged as an
> > error.
>
> Should I be able to use IRIs if spider_threads is set
to a non-zero value?

I would doubt it, httplib2 only understands URIs, I've done
nothing
to enable IRIs.

> Also a new bug report: the way the change to the feed
parser was made
> causes it to not longer respect the default value for
xml:base.

I will work on that. (Adding unit tests for these changes to
both
feedparser and venus are on my list of things to do).

   Thanks,
   -joe

-- 
Joe Gregorio        http://bitworking.org
-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )