List Info

Thread: Re: Status of htdig support of crawling SSL sites?




Re: Status of htdig support of crawling SSL sites?
country flaguser name
United States
2007-03-29 19:39:03
On Thu, 29 Mar 2007, Joe Auty wrote:

> Date: Thu, 29 Mar 2007 17:05:55 -0400
> From: Joe Auty <jautyindiana.edu>
> To: Jim Cole <listsyggdrasill.net>
> Cc: htdig-generallists.sourceforge.net
> Subject: Re: [htdig] Status of htdig support of
crawling SSL sites?
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Cool, that sounds exactly like what I've been
experiencing.
> 
> Do you think somebody could send me a copy of the
patch? I expect that
> some characters were garbled with my copy and paste of
the list
> archive, as I'm getting the following (where
"patch" is a file
> containing the paste of the patch included within this
thread)

Here it is on the patch site:

ftp://ftp.ccsf.org/htdig-patches/3.2.0b6/SSL_pending.0

Regards,

Joe
-- 
     _/   _/_/_/       _/              ____________    __o
     _/   _/   _/      _/         ______________    
_-<,_
 _/  _/   _/_/_/   _/  _/                     ......(_)/
(_)
  _/_/ oe _/   _/.  _/_/ ah        jjahcloud.ccsf.cc.ca.us

> 
> # patch -p0 <patch
> patching file htnet/SSLConnection.cc
> patch: **** malformed patch at line 4: {
> 
> 
> Thanks in advance!
> 
> 
> 
> Jim Cole wrote:
> > Hi - There is a bug with SSL handling in the
3.2.0b6 code. While a
> > patch has been applied to what is in CVS, it will
not be in the
> > version you are working with unless the RPM
producer added it on
> > their own. I suspect that it is this bug that is
causing the
> > problems you report below. The reason it seems to
take an
> > unexpectedly long time to generate the output you
provided is due to
> > the fact that the connection is not functioning
properly and has to
> > wait on a 30 second timeout three times before
htdig gives up.
> >
> > Here are a couple links regarding the SSL bug,
including a patch.
> >
> > 
> > http
://sourceforge.net/mailarchive/message.php?msg_id=Pine.LNX.4
.60.0408262141580.7827%40loki.yggdrasill.net
> >
> > 
> > http://sourceforge.net/mailarchive/
forum.php?thread_name=Pine.LNX.4.63.0505230007240.26726%40lo
ki.yggdrasill.net&forum_name=htdig-dev
> >
> >
> > Jim
> >
> >
> > On Mar 29, 2007, at 11:08 AM, Joe Auty wrote:
> >
> >> I've rebuilt htdig with --with-ssl, and it
appears as if I have SSL
> >> support, but I'm having the following problem
now... I'm wondering if
> >> htdig has difficulty with self-signed SSL
certs? As it stands, it
> >> takes quite a while to produce the following
output:
> >>
> >> (I've replaced the real URL of my site with
mysite.com)
> >>
> >>
> >>
> >> # ./htdig -vvvvv
> >> ht://dig Start Time: Thu Mar 29 12:34:34 2007
> >>         1:1:https://mysite.com/assetdb

> >> New server: mysite.com, 443
> >>  - Persistent connections: enabled
> >>  - HEAD before GET: enabled
> >>  - Timeout: 30
> >>  - Connection space: 0
> >>  - Max Documents: -1
> >>  - TCP retries: 1
> >>  - TCP wait time: 5
> >>  - Accept-Language:
> >> Trying to retrieve robots.txt file
> >> Creating an HtHTTPSecure object
> >> Making HTTPS request on https://mysite.com/robo
ts.txt
> >>   Making a HEAD call before the GET
> >> Try to get through to host mysite.com (port
443)
> >>     1 - Open of the connection ok
> >>         Assigned the remote host mysite.com
> >>         Assigned the port 443
> >>     1 - Connection fell down ... let's close
it
> >> Request time: 30 secs
> >>   Making a HEAD call before the GET
> >> Try to get through to host mysite.com (port
443)
> >>     2 - Open of the connection ok
> >>         Assigned the remote host mysite.com
> >>         Assigned the port 443
> >>     2 - Connection fell down ... let's close
it
> >> Request time: 30 secs
> >> .  Making a HEAD call before the GET
> >> Try to get through to host mysite.com (port
443)
> >>     3 - Open of the connection ok
> >>         Assigned the remote host mysite.com
> >>         Assigned the port 443
> >>     3 - Connection fell down ... let's close
it
> >> Request time: 30 secs
> >> . pushed
> >> pick: mysite.com, # servers = 1
> >>> mysite.com supports HTTP persistent
connections (infinite)
> >> ht://dig End Time: Thu Mar 29 12:36:04 2007
> >
> 
> 
> - --
> Joe Auty
> UITS Messaging
> Indiana University
> jautyindiana.edu
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

> 
>
iD8DBQFGDCozIGoilq3QRWsRAkQXAKDNvT2B43wn4rI3Qg62p+ET5RO+UgCf
eD5e
> dGib3fpIJ3LYc+7B+ZrRxYw=
> =mFVu
> -----END PGP SIGNATURE-----
> 
> 
>
------------------------------------------------------------
-------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the
chance to share your
> opinions on IT & business topics through brief
surveys-and earn cash
> http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
> ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral


------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

Re: Status of htdig support of crawling SSL sites?
country flaguser name
United States
2007-03-30 11:51:27
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Joe R. Jah wrote:
> Here it is on the patch site:
>
> ftp://ftp.ccsf.org/htdig-patches/3.2.0b6/SSL_pending.0
>
> Regards,
>
> Joe


Hmmm... After running teh patch on the source and
recompiling from
source, I'm still getting the following... Any ideas?




]# /opt/www/bin/rundig -vvvvv
ht://dig Start Time: Fri Mar 30 12:48:49 2007
        1:1:https://mysecuresite
.com/assetdb
New server: mysecuresite.com, 443
 - Persistent connections: enabled
 - HEAD before GET: enabled
 - Timeout: 30
 - Connection space: 0
 - Max Documents: -1
 - TCP retries: 1
 - TCP wait time: 5
 - Accept-Language:
Trying to retrieve robots.txt file
Creating an HtHTTPSecure object
Making HTTPS request on https://mysecures
ite.com/robots.txt
  Making a HEAD call before the GET
Try to get through to host mysecuresite.com (port 443)
    1 - Open of the connection ok
        Assigned the remote host mysecuresite.com
        Assigned the port 443
    1 - Connection fell down ... let's close it
Request time: 30 secs
  Making a HEAD call before the GET
Try to get through to host mysecuresite.com (port 443)
    2 - Open of the connection ok
        Assigned the remote host mysecuresite.com
        Assigned the port 443
    2 - Connection fell down ... let's close it
Request time: 30 secs
.  Making a HEAD call before the GET
Try to get through to host mysecuresite.com (port 443)
    3 - Open of the connection ok
        Assigned the remote host mysecuresite.com
        Assigned the port 443
    3 - Connection fell down ... let's close it
Request time: 30 secs
. pushed
pick: mysecuresite.com, # servers = 1
> mysecuresite.com supports HTTP persistent connections
(infinite)
ht://dig End Time: Fri Mar 30 12:50:19 2007
htpurge: Database is empty!

Preamble text:


Postamble text:
Note: This message will be sent again if you do not change
or
take away the notification of the above mentioned HTML
page.

Find out more about the notification service at

    http://www.htdig.org/m
eta.html

Cheers!

ht://Dig Notification Service





- --
Joe Auty
UITS Messaging
Indiana University
jautyindiana.edu
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFGDUAPIGoilq3QRWsRAsp8AKDvMrmCJRynVBcEquaRJlYb92X43wCg
ufNS
N4f7fAaHWCIuWU1Ssrd/fOk=
=gWmF
-----END PGP SIGNATURE-----


------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

Re: Status of htdig support of crawling SSL sites?
country flaguser name
United States
2007-04-02 15:02:17
On Mar 30, 2007, at 10:51 AM, Joe Auty wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Joe R. Jah wrote:
>> Here it is on the patch site:
>>
>>
ftp://ftp.ccsf.org/htdig-patches/3.2.0b6/SSL_pending.0
>>
>> Regards,
>>
>> Joe
>
>
> Hmmm... After running teh patch on the source and
recompiling from
> source, I'm still getting the following... Any ideas?

If you haven't already done so, I would suggest you visually
verify  
that the patch applied correctly; the key is the addition of
the  
SSL_pending line in htnet/SSLConnection.cc. I would also
double check  
that the new htdig executable was rebuilt and installed to
the  
correct location. Is there any chance that another copy of
htdig is  
installed somewhere on the system? Perhaps from a previous
build, RPM  
install, etc? If so, make sure there is no chance that it is
being  
executed in place of your patched version.

The error you are seeing is occurring when htdig tries to
read the  
header returned from its first request. Its attempt to read
from the  
connection times out waiting for data. This ultimately
occurs in  
Read_Partial() of SSLConnection, which is the method that
the patch  
(theoretically) fixes. The fact that you are still seeing
the same  
errors here indicates that you might not really be running a
properly  
patched version of htdig. The other possibility is that
there is  
still a problem in this code that the patch does not fully
address.  
If this is the case, I don't know what else to suggest at
this point.  
If the site is publicly accessible, I could try reproducing
the  
problem and taking a closer look at what is happening at the
 
connection level.

Jim


------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

Re: Status of htdig support of crawling SSL sites?
country flaguser name
United States
2007-04-03 13:46:30
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jim Cole wrote:
> On Mar 30, 2007, at 10:51 AM, Joe Auty wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>
>> Joe R. Jah wrote:
>>> Here it is on the patch site:
>>>
>>>
ftp://ftp.ccsf.org/htdig-patches/3.2.0b6/SSL_pending.0
>>>
>>> Regards,
>>>
>>> Joe
>>
>>
>> Hmmm... After running teh patch on the source and
recompiling
>> from source, I'm still getting the following... Any
ideas?
>
> If you haven't already done so, I would suggest you
visually verify
>  that the patch applied correctly; the key is the
addition of the
> SSL_pending line in htnet/SSLConnection.cc. I would
also double
> check that the new htdig executable was rebuilt and
installed to
> the correct location. Is there any chance that another
copy of
> htdig is installed somewhere on the system? Perhaps
from a previous
> build, RPM install, etc? If so, make sure there is no
chance that
> it is being executed in place of your patched version.
>

Yes, the patch has been applied:

$ diff SSLConnection.cc orig_SSLConnection.cc
134,147c134,145
<       if (!SSL_pending(ssl)) {
<         if (timeout_value > 0) {
<             FD_SET_T fds;
<             FD_ZERO(&fds);
<             FD_SET(sock, &fds);
<
<             timeval tv;
<             tv.tv_sec = timeout_value;
<             tv.tv_usec = 0;
<
<             int selected = select(sock+1, &fds, 0,
0, &tv);
<             if (selected <= 0)
<                 need_io_stop++;
<         }
- ---
> if (timeout_value > 0) { FD_SET_T fds;
FD_ZERO(&fds); FD_SET(sock,
> &fds);
>
> timeval tv; tv.tv_sec = timeout_value; tv.tv_usec = 0;
>
> int selected = select(sock+1, &fds, 0, 0, &tv);
if (selected <= 0)
> need_io_stop++;


Could this problem be with my using a self-signed
certificate?



> The error you are seeing is occurring when htdig tries
to read the
> header returned from its first request. Its attempt to
read from
> the connection times out waiting for data. This
ultimately occurs
> in Read_Partial() of SSLConnection, which is the method
that the
> patch (theoretically) fixes. The fact that you are
still seeing the
> same errors here indicates that you might not really be
running a
> properly patched version of htdig. The other
possibility is that
> there is still a problem in this code that the patch
does not fully
>  address. If this is the case, I don't know what else
to suggest at
>  this point. If the site is publicly accessible, I
could try
> reproducing the problem and taking a closer look at
what is
> happening at the connection level.
>
> Jim
>


- --
Joe Auty
UITS Messaging
Indiana University
jautyindiana.edu
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFGEqEGIGoilq3QRWsRAk11AJ9c5dfHFpvQqLbCl6VZIN6Dx6UK7ACg
pZCF
uQkJTCWc4aFSW/GhN7wzsx4=
=nv49
-----END PGP SIGNATURE-----


------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

Re: Status of htdig support of crawling SSL sites?
country flaguser name
United States
2007-04-03 16:19:19
On Apr 3, 2007, at 12:46 PM, Joe Auty wrote:

> Could this problem be with my using a self-signed
certificate?

To the best of my knowledge a self-signed certificate
shouldn't be a  
problem. I don't recall anyone ever reporting that as an
issue, and I  
do seem to recall at least one person stating that they
tested with a  
self-signed certificate (don't remember what version of
ht://Dig).  
Sorry that I can't provide a more certain answer.

If you have access to the server logs you might want to
check those.  
In short what appears to be happening is that htdig makes a
request  
for robots.txt, goes to the connection to look for a
response, and  
times out before it sees one. It tries waiting on the
response three  
times and then bails. The most obvious causes are that the
connection  
is going away from some reason, the server is not sending a
response,  
or htdig is for some reason unable to see the response. The
patch  
corrected one case where the latter could occur.

Is there any possibility that a firewall might be blocking
the  
response? Have you tried accessing the site with a regular
browser  
from the machine running htdig?

Jim

------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )