|
List Info
Thread: Re: Status of htdig support of crawling SSL sites?
|
|
| Re: Status of htdig support of crawling
SSL sites? |
  United States |
2007-03-29 19:39:03 |
On Thu, 29 Mar 2007, Joe Auty wrote:
> Date: Thu, 29 Mar 2007 17:05:55 -0400
> From: Joe Auty <jauty indiana.edu>
> To: Jim Cole <lists yggdrasill.net>
> Cc: htdig-general lists.sourceforge.net
> Subject: Re: [htdig] Status of htdig support of
crawling SSL sites?
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Cool, that sounds exactly like what I've been
experiencing.
>
> Do you think somebody could send me a copy of the
patch? I expect that
> some characters were garbled with my copy and paste of
the list
> archive, as I'm getting the following (where
"patch" is a file
> containing the paste of the patch included within this
thread)
Here it is on the patch site:
ftp://ftp.ccsf.org/htdig-patches/3.2.0b6/SSL_pending.0
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________
_-<,_
_/ _/ _/_/_/ _/ _/ ......(_)/
(_)
_/_/ oe _/ _/. _/_/ ah jjah cloud.ccsf.cc.ca.us
>
> # patch -p0 <patch
> patching file htnet/SSLConnection.cc
> patch: **** malformed patch at line 4: {
>
>
> Thanks in advance!
>
>
>
> Jim Cole wrote:
> > Hi - There is a bug with SSL handling in the
3.2.0b6 code. While a
> > patch has been applied to what is in CVS, it will
not be in the
> > version you are working with unless the RPM
producer added it on
> > their own. I suspect that it is this bug that is
causing the
> > problems you report below. The reason it seems to
take an
> > unexpectedly long time to generate the output you
provided is due to
> > the fact that the connection is not functioning
properly and has to
> > wait on a 30 second timeout three times before
htdig gives up.
> >
> > Here are a couple links regarding the SSL bug,
including a patch.
> >
> >
> > http
://sourceforge.net/mailarchive/message.php?msg_id=Pine.LNX.4
.60.0408262141580.7827%40loki.yggdrasill.net
> >
> >
> > http://sourceforge.net/mailarchive/
forum.php?thread_name=Pine.LNX.4.63.0505230007240.26726%40lo
ki.yggdrasill.net&forum_name=htdig-dev
> >
> >
> > Jim
> >
> >
> > On Mar 29, 2007, at 11:08 AM, Joe Auty wrote:
> >
> >> I've rebuilt htdig with --with-ssl, and it
appears as if I have SSL
> >> support, but I'm having the following problem
now... I'm wondering if
> >> htdig has difficulty with self-signed SSL
certs? As it stands, it
> >> takes quite a while to produce the following
output:
> >>
> >> (I've replaced the real URL of my site with
mysite.com)
> >>
> >>
> >>
> >> # ./htdig -vvvvv
> >> ht://dig Start Time: Thu Mar 29 12:34:34 2007
> >> 1:1:https://mysite.com/assetdb
> >> New server: mysite.com, 443
> >> - Persistent connections: enabled
> >> - HEAD before GET: enabled
> >> - Timeout: 30
> >> - Connection space: 0
> >> - Max Documents: -1
> >> - TCP retries: 1
> >> - TCP wait time: 5
> >> - Accept-Language:
> >> Trying to retrieve robots.txt file
> >> Creating an HtHTTPSecure object
> >> Making HTTPS request on https://mysite.com/robo
ts.txt
> >> Making a HEAD call before the GET
> >> Try to get through to host mysite.com (port
443)
> >> 1 - Open of the connection ok
> >> Assigned the remote host mysite.com
> >> Assigned the port 443
> >> 1 - Connection fell down ... let's close
it
> >> Request time: 30 secs
> >> Making a HEAD call before the GET
> >> Try to get through to host mysite.com (port
443)
> >> 2 - Open of the connection ok
> >> Assigned the remote host mysite.com
> >> Assigned the port 443
> >> 2 - Connection fell down ... let's close
it
> >> Request time: 30 secs
> >> . Making a HEAD call before the GET
> >> Try to get through to host mysite.com (port
443)
> >> 3 - Open of the connection ok
> >> Assigned the remote host mysite.com
> >> Assigned the port 443
> >> 3 - Connection fell down ... let's close
it
> >> Request time: 30 secs
> >> . pushed
> >> pick: mysite.com, # servers = 1
> >>> mysite.com supports HTTP persistent
connections (infinite)
> >> ht://dig End Time: Thu Mar 29 12:36:04 2007
> >
>
>
> - --
> Joe Auty
> UITS Messaging
> Indiana University
> jauty indiana.edu
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
>
iD8DBQFGDCozIGoilq3QRWsRAkQXAKDNvT2B43wn4rI3Qg62p+ET5RO+UgCf
eD5e
> dGib3fpIJ3LYc+7B+ZrRxYw=
> =mFVu
> -----END PGP SIGNATURE-----
>
>
>
------------------------------------------------------------
-------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the
chance to share your
> opinions on IT & business topics through brief
surveys-and earn cash
> http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> ht://Dig general mailing list: <htdig-general lists.sourceforge.net>
> ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral
------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-general lists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral
|
|
| Re: Status of htdig support of crawling
SSL sites? |
  United States |
2007-03-30 11:51:27 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Joe R. Jah wrote:
> Here it is on the patch site:
>
> ftp://ftp.ccsf.org/htdig-patches/3.2.0b6/SSL_pending.0
>
> Regards,
>
> Joe
Hmmm... After running teh patch on the source and
recompiling from
source, I'm still getting the following... Any ideas?
]# /opt/www/bin/rundig -vvvvv
ht://dig Start Time: Fri Mar 30 12:48:49 2007
1:1:https://mysecuresite
.com/assetdb
New server: mysecuresite.com, 443
- Persistent connections: enabled
- HEAD before GET: enabled
- Timeout: 30
- Connection space: 0
- Max Documents: -1
- TCP retries: 1
- TCP wait time: 5
- Accept-Language:
Trying to retrieve robots.txt file
Creating an HtHTTPSecure object
Making HTTPS request on https://mysecures
ite.com/robots.txt
Making a HEAD call before the GET
Try to get through to host mysecuresite.com (port 443)
1 - Open of the connection ok
Assigned the remote host mysecuresite.com
Assigned the port 443
1 - Connection fell down ... let's close it
Request time: 30 secs
Making a HEAD call before the GET
Try to get through to host mysecuresite.com (port 443)
2 - Open of the connection ok
Assigned the remote host mysecuresite.com
Assigned the port 443
2 - Connection fell down ... let's close it
Request time: 30 secs
. Making a HEAD call before the GET
Try to get through to host mysecuresite.com (port 443)
3 - Open of the connection ok
Assigned the remote host mysecuresite.com
Assigned the port 443
3 - Connection fell down ... let's close it
Request time: 30 secs
. pushed
pick: mysecuresite.com, # servers = 1
> mysecuresite.com supports HTTP persistent connections
(infinite)
ht://dig End Time: Fri Mar 30 12:50:19 2007
htpurge: Database is empty!
Preamble text:
Postamble text:
Note: This message will be sent again if you do not change
or
take away the notification of the above mentioned HTML
page.
Find out more about the notification service at
http://www.htdig.org/m
eta.html
Cheers!
ht://Dig Notification Service
- --
Joe Auty
UITS Messaging
Indiana University
jauty indiana.edu
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGDUAPIGoilq3QRWsRAsp8AKDvMrmCJRynVBcEquaRJlYb92X43wCg
ufNS
N4f7fAaHWCIuWU1Ssrd/fOk=
=gWmF
-----END PGP SIGNATURE-----
------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-general lists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral
|
|
| Re: Status of htdig support of crawling
SSL sites? |
  United States |
2007-04-02 15:02:17 |
On Mar 30, 2007, at 10:51 AM, Joe Auty wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Joe R. Jah wrote:
>> Here it is on the patch site:
>>
>>
ftp://ftp.ccsf.org/htdig-patches/3.2.0b6/SSL_pending.0
>>
>> Regards,
>>
>> Joe
>
>
> Hmmm... After running teh patch on the source and
recompiling from
> source, I'm still getting the following... Any ideas?
If you haven't already done so, I would suggest you visually
verify
that the patch applied correctly; the key is the addition of
the
SSL_pending line in htnet/SSLConnection.cc. I would also
double check
that the new htdig executable was rebuilt and installed to
the
correct location. Is there any chance that another copy of
htdig is
installed somewhere on the system? Perhaps from a previous
build, RPM
install, etc? If so, make sure there is no chance that it is
being
executed in place of your patched version.
The error you are seeing is occurring when htdig tries to
read the
header returned from its first request. Its attempt to read
from the
connection times out waiting for data. This ultimately
occurs in
Read_Partial() of SSLConnection, which is the method that
the patch
(theoretically) fixes. The fact that you are still seeing
the same
errors here indicates that you might not really be running a
properly
patched version of htdig. The other possibility is that
there is
still a problem in this code that the patch does not fully
address.
If this is the case, I don't know what else to suggest at
this point.
If the site is publicly accessible, I could try reproducing
the
problem and taking a closer look at what is happening at the
connection level.
Jim
------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-general lists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral
|
|
| Re: Status of htdig support of crawling
SSL sites? |
  United States |
2007-04-03 13:46:30 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Jim Cole wrote:
> On Mar 30, 2007, at 10:51 AM, Joe Auty wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>
>> Joe R. Jah wrote:
>>> Here it is on the patch site:
>>>
>>>
ftp://ftp.ccsf.org/htdig-patches/3.2.0b6/SSL_pending.0
>>>
>>> Regards,
>>>
>>> Joe
>>
>>
>> Hmmm... After running teh patch on the source and
recompiling
>> from source, I'm still getting the following... Any
ideas?
>
> If you haven't already done so, I would suggest you
visually verify
> that the patch applied correctly; the key is the
addition of the
> SSL_pending line in htnet/SSLConnection.cc. I would
also double
> check that the new htdig executable was rebuilt and
installed to
> the correct location. Is there any chance that another
copy of
> htdig is installed somewhere on the system? Perhaps
from a previous
> build, RPM install, etc? If so, make sure there is no
chance that
> it is being executed in place of your patched version.
>
Yes, the patch has been applied:
$ diff SSLConnection.cc orig_SSLConnection.cc
134,147c134,145
< if (!SSL_pending(ssl)) {
< if (timeout_value > 0) {
< FD_SET_T fds;
< FD_ZERO(&fds);
< FD_SET(sock, &fds);
<
< timeval tv;
< tv.tv_sec = timeout_value;
< tv.tv_usec = 0;
<
< int selected = select(sock+1, &fds, 0,
0, &tv);
< if (selected <= 0)
< need_io_stop++;
< }
- ---
> if (timeout_value > 0) { FD_SET_T fds;
FD_ZERO(&fds); FD_SET(sock,
> &fds);
>
> timeval tv; tv.tv_sec = timeout_value; tv.tv_usec = 0;
>
> int selected = select(sock+1, &fds, 0, 0, &tv);
if (selected <= 0)
> need_io_stop++;
Could this problem be with my using a self-signed
certificate?
> The error you are seeing is occurring when htdig tries
to read the
> header returned from its first request. Its attempt to
read from
> the connection times out waiting for data. This
ultimately occurs
> in Read_Partial() of SSLConnection, which is the method
that the
> patch (theoretically) fixes. The fact that you are
still seeing the
> same errors here indicates that you might not really be
running a
> properly patched version of htdig. The other
possibility is that
> there is still a problem in this code that the patch
does not fully
> address. If this is the case, I don't know what else
to suggest at
> this point. If the site is publicly accessible, I
could try
> reproducing the problem and taking a closer look at
what is
> happening at the connection level.
>
> Jim
>
- --
Joe Auty
UITS Messaging
Indiana University
jauty indiana.edu
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGEqEGIGoilq3QRWsRAk11AJ9c5dfHFpvQqLbCl6VZIN6Dx6UK7ACg
pZCF
uQkJTCWc4aFSW/GhN7wzsx4=
=nv49
-----END PGP SIGNATURE-----
------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-general lists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral
|
|
| Re: Status of htdig support of crawling
SSL sites? |
  United States |
2007-04-03 16:19:19 |
On Apr 3, 2007, at 12:46 PM, Joe Auty wrote:
> Could this problem be with my using a self-signed
certificate?
To the best of my knowledge a self-signed certificate
shouldn't be a
problem. I don't recall anyone ever reporting that as an
issue, and I
do seem to recall at least one person stating that they
tested with a
self-signed certificate (don't remember what version of
ht://Dig).
Sorry that I can't provide a more certain answer.
If you have access to the server logs you might want to
check those.
In short what appears to be happening is that htdig makes a
request
for robots.txt, goes to the connection to look for a
response, and
times out before it sees one. It tries waiting on the
response three
times and then bails. The most obvious causes are that the
connection
is going away from some reason, the server is not sending a
response,
or htdig is for some reason unable to see the response. The
patch
corrected one case where the latter could occur.
Is there any possibility that a firewall might be blocking
the
response? Have you tried accessing the site with a regular
browser
from the machine running htdig?
Jim
------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-general lists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral
|
|
[1-5]
|
|