List Info

Thread: Parallelizing Fetches




Parallelizing Fetches
user name
2006-11-05 10:31:31
Joe Gregorio wrote:
> On 11/4/06, Sam Ruby <rubysintertwingly.net> wrote:
>> I'm still seeing this.
> 
> You will have to clear out "sources/http",
the non-absolute
> location: uri was stored there. The fixed version of
httplib2
> was patched to store the right value, not to
> absolutize the one retrieved from the cache.

Cool.

>> Furthermore, once such an error occurs, it
>> appears that that thread no longer services
requests.
> 
> Yes, any uncaught exceptions in a thread terminate the
thread.

What puzzles me is that the logic that I see appears to take
great care 
not to hae any uncaught exceptions, both _spider_proc and
the code that 
processes the work_queue catch Exception, log it, and should
(to my 
reading) continue with the loop.

>> > The second type of problem was a failure to
resolve the server
>> > name. Now fixed, that type of exception is now
caught and logged as an
>> > error.
>>
>> Should I be able to use IRIs if spider_threads is
set to a non-zero 
>> value?
> 
> I would doubt it, httplib2 only understands URIs, I've
done nothing
> to enable IRIs.

All it takes is code like the following.  If done within
httplib2, every 
user of that library would benefit:

     # iri support
     try:
         if isinstance(url,unicode):
             url = url.encode('idna')
         else:
             url = url.decode('utf-8').encode('idna')
     except:
         pass

The above should e safe.  The Python libraries are smart
enough to only 
operate on the host portion of the URI.  If the host portion
of the URI 
does not have any high bit characters, nothing is done. 
Also if the 
input url is not valid utf-8 (a requirement for IRIs), then
again, 
nothing is done.

>> Also a new bug report: the way the change to the
feed parser was made
>> causes it to not longer respect the default value
for xml:base.
> 
> I will work on that. (Adding unit tests for these
changes to both
> feedparser and venus are on my list of things to do).

It occurs to me that no change to the feed parser is
required.  The feed 
parser is set up to handle arbitrary "file like"
objects - gotta love 
duck typing.  Here's a rough sketch:

     data = StringIO(content)
     setattr(data,'url',feed)
     setattr(data,'headers',resp_headers)
     feedparser.parse(data)

Of course, if httplib2 takes care of unzipping and
deflating, headers 
like content-encoding may need to be removed lest the
feedparser tries 
to unzip the results.

Another thing to care about in the case of redirects, is
that the url 
property should be set to the value of the location header.

Again, it seems to me that such logic may benefit other
users of httplib2.

- Sam Ruby
-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
Parallelizing Fetches
user name
2006-11-06 04:24:37
On 11/5/06, Sam Ruby <rubysintertwingly.net> wrote:
> All it takes is code like the following.  If done
within httplib2, every
> user of that library would benefit:
>
>      # iri support
>      try:
>          if isinstance(url,unicode):
>              url = url.encode('idna')
>          else:
>              url = url.decode('utf-8').encode('idna')
>      except:
>          pass

Excellent, I will add this to httplib2 soon.

> It occurs to me that no change to the feed parser is
required.  The feed
> parser is set up to handle arbitrary "file
like" objects - gotta love
> duck typing.  Here's a rough sketch:
>
>      data = StringIO(content)
>      setattr(data,'url',feed)
>      setattr(data,'headers',resp_headers)
>      feedparser.parse(data)
>
> Of course, if httplib2 takes care of unzipping and
deflating, headers
> like content-encoding may need to be removed lest the
feedparser tries
> to unzip the results.

Done.

Required a small change to httplib2 to make the Response
object conform to feedparser's expectations, and I undid the
changes to feedparser.

Still needs unit tests, and I believe this change causes the
HTTP
status code to get eaten along the way.

> Another thing to care about in the case of redirects,
is that the url
> property should be set to the value of the location
header.
>
> Again, it seems to me that such logic may benefit other
users of httplib2.

That still needs to be done.

   -joe

-- 
Joe Gregorio        http://bitworking.org
-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
Parallelizing Fetches
user name
2006-11-06 11:31:05
Joe Gregorio wrote:
> 
> I believe this change causes the HTTP
> status code to get eaten along the way.

The feedparser already made provisions for this:

     if hasattr(f, 'status'):
         result['status'] = f.status

- Sam Ruby
-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
Parallelizing Fetches
user name
2006-11-07 16:05:31
On 11/5/06, Sam Ruby <rubysintertwingly.net> wrote:
>      # iri support
>      try:
>          if isinstance(url,unicode):
>              url = url.encode('idna')
>          else:
>              url = url.decode('utf-8').encode('idna')
>      except:
>          pass

I still plan on adding in IRI support, but it appears  the
idna encoding
doesn't restrict itself to just the host name:

>>> a = u"http://bitworking.org/projects/httplib2/tes
t/reflector/reflector.cgi/N"
>>> a.encode('idna')
'http://bitworking.org/projects/htt
plib2/test/reflector/reflector.xn--cgi/-eh3b'
>>>

  -joe

-- 
Joe Gregorio        http://bitworking.org
-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
Parallelizing Fetches
user name
2006-11-07 18:25:48
On 11/5/06, Sam Ruby <rubysintertwingly.net> wrote:
> Another thing to care about in the case of redirects,
is that the url
> property should be set to the value of the location
header.
>
> Again, it seems to me that such logic may benefit other
users of httplib2.

Done. Httplib2 now provides a '-location' header in all
responses
that contains the URI that was ultimately requested. That
version
of httplib2 is now in:

   http://bitworking.org/projects/venus/branches/threaded/

and that support has now been added so that
the '-location' URI is passed to feedparser.

Writing unit tests for the threaded support looks like it
will
require using SimpleHTTPServer.

   -joe

-- 
Joe Gregorio        http://bitworking.org
-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
Parallelizing Fetches
user name
2006-11-14 19:32:19
Joe Gregorio wrote:
> On 11/5/06, Sam Ruby <rubysintertwingly.net> wrote:
>> Another thing to care about in the case of
redirects, is that the url
>> property should be set to the value of the location
header.
>>
>> Again, it seems to me that such logic may benefit
other users of 
>> httplib2.
> 
> Done. Httplib2 now provides a '-location' header in all
responses
> that contains the URI that was ultimately requested.
That version
> of httplib2 is now in:
> 
>   http://bitworking.org/projects/venus/branches/threaded/
> 
> and that support has now been added so that
> the '-location' URI is passed to feedparser.

I've merged in this branch.

> Writing unit tests for the threaded support looks like
it will
> require using SimpleHTTPServer.

I've added a single test case using SimpleHTTPServer.  You
can find it in:

http://intertwingly.net/code/venus/tests/test_spider.py

In the process, I made a trivial patch to httplib2 to make
it run on 
Python 2.2.  The patch was the addition of one line:

from __future__ import generators

>   -joe

- Sam Ruby

-- 
devel mailing list
devellists.planetplanet.org

http://lists.planetplanet.org/mailman/listinfo/devel
[1-6]

about | contact  Other archives ( Real Estate discussion Medical topics )