List Info

Thread: Status of htdig support of crawling SSL sites?




Status of htdig support of crawling SSL sites?
country flaguser name
United States
2007-03-29 10:59:31
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

I was unable to find this info in a Google search since it
seems that
the Sourceforge mailing list archive for this list is a
little messed
up...


What is needed for htdig to be able to crawl SSL sites? It
seems that
the Redhat RPM does not include support for this out of the
box, so
I'm wondering if there are any build instructions out there
for the
latest version of htdig?

- --
Joe Auty
UITS Messaging
Indiana University
jautyindiana.edu
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFGC+JjIGoilq3QRWsRAv+sAKDyJDSv1e4uka6JaW7ppYLzQBTJYgCg
knv3
XWGd4WyNt8pyCMHI+VJqbnA=
=hkpv
-----END PGP SIGNATURE-----


------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

Re: Status of htdig support of crawling SSL sites?
country flaguser name
United States
2007-03-29 12:08:47
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Joe Auty wrote:
> Hello,
>
> I was unable to find this info in a Google search since
it seems
> that the Sourceforge mailing list archive for this list
is a little
> messed up...
>
>
> What is needed for htdig to be able to crawl SSL sites?
It seems
> that the Redhat RPM does not include support for this
out of the
> box, so I'm wondering if there are any build
instructions out there
> for the latest version of htdig?
>

I've rebuilt htdig with --with-ssl, and it appears as if I
have SSL
support, but I'm having the following problem now... I'm
wondering if
htdig has difficulty with self-signed SSL certs? As it
stands, it
takes quite a while to produce the following output:

(I've replaced the real URL of my site with mysite.com)



# ./htdig -vvvvv
ht://dig Start Time: Thu Mar 29 12:34:34 2007
        1:1:https://mysite.com/assetdb

New server: mysite.com, 443
 - Persistent connections: enabled
 - HEAD before GET: enabled
 - Timeout: 30
 - Connection space: 0
 - Max Documents: -1
 - TCP retries: 1
 - TCP wait time: 5
 - Accept-Language:
Trying to retrieve robots.txt file
Creating an HtHTTPSecure object
Making HTTPS request on https://mysite.com/robo
ts.txt
  Making a HEAD call before the GET
Try to get through to host mysite.com (port 443)
    1 - Open of the connection ok
        Assigned the remote host mysite.com
        Assigned the port 443
    1 - Connection fell down ... let's close it
Request time: 30 secs
  Making a HEAD call before the GET
Try to get through to host mysite.com (port 443)
    2 - Open of the connection ok
        Assigned the remote host mysite.com
        Assigned the port 443
    2 - Connection fell down ... let's close it
Request time: 30 secs
.  Making a HEAD call before the GET
Try to get through to host mysite.com (port 443)
    3 - Open of the connection ok
        Assigned the remote host mysite.com
        Assigned the port 443
    3 - Connection fell down ... let's close it
Request time: 30 secs
. pushed
pick: mysite.com, # servers = 1
> mysite.com supports HTTP persistent connections
(infinite)
ht://dig End Time: Thu Mar 29 12:36:04 2007



- --
Joe Auty
UITS Messaging
Indiana University
jautyindiana.edu

- --
Joe Auty
UITS Messaging
Indiana University
jautyindiana.edu

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFGC/KfIGoilq3QRWsRApJUAJ9ZCLhFpniGJ3v4JJX7xJy19vGKEgCf
fc+N
Oz6ZYgJDKq/UFhRS8Oo3bRY=
=DQHH
-----END PGP SIGNATURE-----


------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

Re: Status of htdig support of crawling SSL sites?
user name
2007-03-29 13:18:43
On Thu, Mar 29, 2007 at 01:08:47PM -0400, Joe Auty wrote:

> I've rebuilt htdig with --with-ssl, and it appears as
if I have SSL
> support, but I'm having the following problem now...
I'm wondering if
> htdig has difficulty with self-signed SSL certs?

IIRC, yes.
Some time ago, I ran into the same problem. I solved it by
allowing http
only for the server running htdig, and then I have in
htdig.conf:

search_rewrite_rules: http://www.example.org/(.
*) 
                      https://www.example.org/
\1

So all links are https://

Rainer

------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

Re: Status of htdig support of crawling SSL sites?
country flaguser name
United States
2007-03-29 14:26:27
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Rainer Sokoll wrote:
> On Thu, Mar 29, 2007 at 01:08:47PM -0400, Joe Auty
wrote:
>
>> I've rebuilt htdig with --with-ssl, and it appears
as if I have
>> SSL support, but I'm having the following problem
now... I'm
>> wondering if htdig has difficulty with self-signed
SSL certs?
>
> IIRC, yes. Some time ago, I ran into the same problem.
I solved it
> by allowing http only for the server running htdig, and
then I have
> in htdig.conf:
>
> search_rewrite_rules: http://www.example.org/(.
*) 
> https://www.example.org/
\1
>
> So all links are https://
>
> Rainer
>

Hmmm... Just so I know what my options are here, is adding
the
self-signed cert to my CA file an option that would work
with htdig?



- --
Joe Auty
UITS Messaging
Indiana University
jautyindiana.edu
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFGDBLjIGoilq3QRWsRAj3oAKDGGQXhVc4hZZpTsnWZ136dm1QPdQCg
xTKx
6HNMUSU9UtbBqmmzjrHKXGc=
=IW+C
-----END PGP SIGNATURE-----


------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

Re: Status of htdig support of crawling SSL sites?
country flaguser name
United States
2007-03-29 15:07:14
Hi - There is a bug with SSL handling in the 3.2.0b6 code.
While a  
patch has been applied to what is in CVS, it will not be in
the  
version you are working with unless the RPM producer added
it on  
their own. I suspect that it is this bug that is causing the
problems  
you report below. The reason it seems to take an
unexpectedly long  
time to generate the output you provided is due to the fact
that the  
connection is not functioning properly and has to wait on a
30 second  
timeout three times before htdig gives up.

Here are a couple links regarding the SSL bug, including a
patch.

   http://sourceforge.net/mailarchive/message.php?ms
g_id=Pine.LNX. 
4.60.0408262141580.7827%40loki.yggdrasill.net
   http://sourceforge.net/mailarchive/forum.php?t
hread_name=Pine.LNX. 
4.63.0505230007240.26726%40loki.yggdrasill.net&forum_nam
e=htdig-dev

Jim


On Mar 29, 2007, at 11:08 AM, Joe Auty wrote:

> I've rebuilt htdig with --with-ssl, and it appears as
if I have SSL
> support, but I'm having the following problem now...
I'm wondering if
> htdig has difficulty with self-signed SSL certs? As it
stands, it
> takes quite a while to produce the following output:
>
> (I've replaced the real URL of my site with
mysite.com)
>
>
>
> # ./htdig -vvvvv
> ht://dig Start Time: Thu Mar 29 12:34:34 2007
>         1:1:https://mysite.com/assetdb

> New server: mysite.com, 443
>  - Persistent connections: enabled
>  - HEAD before GET: enabled
>  - Timeout: 30
>  - Connection space: 0
>  - Max Documents: -1
>  - TCP retries: 1
>  - TCP wait time: 5
>  - Accept-Language:
> Trying to retrieve robots.txt file
> Creating an HtHTTPSecure object
> Making HTTPS request on https://mysite.com/robo
ts.txt
>   Making a HEAD call before the GET
> Try to get through to host mysite.com (port 443)
>     1 - Open of the connection ok
>         Assigned the remote host mysite.com
>         Assigned the port 443
>     1 - Connection fell down ... let's close it
> Request time: 30 secs
>   Making a HEAD call before the GET
> Try to get through to host mysite.com (port 443)
>     2 - Open of the connection ok
>         Assigned the remote host mysite.com
>         Assigned the port 443
>     2 - Connection fell down ... let's close it
> Request time: 30 secs
> .  Making a HEAD call before the GET
> Try to get through to host mysite.com (port 443)
>     3 - Open of the connection ok
>         Assigned the remote host mysite.com
>         Assigned the port 443
>     3 - Connection fell down ... let's close it
> Request time: 30 secs
> . pushed
> pick: mysite.com, # servers = 1
>> mysite.com supports HTTP persistent connections
(infinite)
> ht://dig End Time: Thu Mar 29 12:36:04 2007


------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

Re: Status of htdig support of crawling SSL sites?
user name
2007-03-29 15:16:10
On Thu, Mar 29, 2007 at 03:26:27PM -0400, Joe Auty wrote:

> Hmmm... Just so I know what my options are here, is
adding the
> self-signed cert to my CA file an option that would
work with htdig?

Dunno 
Try it out and report your results here, please.

Rainer

------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

Re: Status of htdig support of crawling SSL sites?
country flaguser name
United States
2007-03-29 16:05:55
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Cool, that sounds exactly like what I've been experiencing.

Do you think somebody could send me a copy of the patch? I
expect that
some characters were garbled with my copy and paste of the
list
archive, as I'm getting the following (where
"patch" is a file
containing the paste of the patch included within this
thread)


# patch -p0 <patch
patching file htnet/SSLConnection.cc
patch: **** malformed patch at line 4: {


Thanks in advance!



Jim Cole wrote:
> Hi - There is a bug with SSL handling in the 3.2.0b6
code. While a
> patch has been applied to what is in CVS, it will not
be in the
> version you are working with unless the RPM producer
added it on
> their own. I suspect that it is this bug that is
causing the
> problems you report below. The reason it seems to take
an
> unexpectedly long time to generate the output you
provided is due to
> the fact that the connection is not functioning
properly and has to
> wait on a 30 second timeout three times before htdig
gives up.
>
> Here are a couple links regarding the SSL bug,
including a patch.
>
> 
> http
://sourceforge.net/mailarchive/message.php?msg_id=Pine.LNX.4
.60.0408262141580.7827%40loki.yggdrasill.net
>
> 
> http://sourceforge.net/mailarchive/
forum.php?thread_name=Pine.LNX.4.63.0505230007240.26726%40lo
ki.yggdrasill.net&forum_name=htdig-dev
>
>
> Jim
>
>
> On Mar 29, 2007, at 11:08 AM, Joe Auty wrote:
>
>> I've rebuilt htdig with --with-ssl, and it appears
as if I have SSL
>> support, but I'm having the following problem
now... I'm wondering if
>> htdig has difficulty with self-signed SSL certs? As
it stands, it
>> takes quite a while to produce the following
output:
>>
>> (I've replaced the real URL of my site with
mysite.com)
>>
>>
>>
>> # ./htdig -vvvvv
>> ht://dig Start Time: Thu Mar 29 12:34:34 2007
>>         1:1:https://mysite.com/assetdb

>> New server: mysite.com, 443
>>  - Persistent connections: enabled
>>  - HEAD before GET: enabled
>>  - Timeout: 30
>>  - Connection space: 0
>>  - Max Documents: -1
>>  - TCP retries: 1
>>  - TCP wait time: 5
>>  - Accept-Language:
>> Trying to retrieve robots.txt file
>> Creating an HtHTTPSecure object
>> Making HTTPS request on https://mysite.com/robo
ts.txt
>>   Making a HEAD call before the GET
>> Try to get through to host mysite.com (port 443)
>>     1 - Open of the connection ok
>>         Assigned the remote host mysite.com
>>         Assigned the port 443
>>     1 - Connection fell down ... let's close it
>> Request time: 30 secs
>>   Making a HEAD call before the GET
>> Try to get through to host mysite.com (port 443)
>>     2 - Open of the connection ok
>>         Assigned the remote host mysite.com
>>         Assigned the port 443
>>     2 - Connection fell down ... let's close it
>> Request time: 30 secs
>> .  Making a HEAD call before the GET
>> Try to get through to host mysite.com (port 443)
>>     3 - Open of the connection ok
>>         Assigned the remote host mysite.com
>>         Assigned the port 443
>>     3 - Connection fell down ... let's close it
>> Request time: 30 secs
>> . pushed
>> pick: mysite.com, # servers = 1
>>> mysite.com supports HTTP persistent connections
(infinite)
>> ht://dig End Time: Thu Mar 29 12:36:04 2007
>


- --
Joe Auty
UITS Messaging
Indiana University
jautyindiana.edu
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFGDCozIGoilq3QRWsRAkQXAKDNvT2B43wn4rI3Qg62p+ET5RO+UgCf
eD5e
dGib3fpIJ3LYc+7B+ZrRxYw=
=mFVu
-----END PGP SIGNATURE-----


------------------------------------------------------------
-------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the
chance to share your
opinions on IT & business topics through brief
surveys-and earn cash
http://www.techsay.com/default.
php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

[1-7]

about | contact  Other archives ( Real Estate discussion Medical topics )