List Info

Thread: GSA Network Diagnostics shows HTTP 505 versus Amazon S3?




GSA Network Diagnostics shows HTTP 505 versus Amazon S3?
country flaguser name
United States
2007-09-27 00:11:58
I'm trying to set up our enterprise Google Search Appliance
to index
documents hosted on Amazon S3 (Simple Storage Solution).

While investigating a problem I'm having with Scheduled
Crawls (they
run on schedule, but take exactly 30 minutes and index no
documents) I
thought that perhaps the GSA was failing to access start
URLs.  I
tried putting an S3 URL into the Administration > Network
Diagnostics
page.  I get this:

http://world.se
condlife.com.s3.amazonaws.com/classified/16642eda-6954-21d4-
31a2-9b9868221f8d.html
 	 returncode 505, should be 200
http://www.google.com/
 	 OK

The GSA documentation says that Network Diagnostics does a
ping then
an HTTP HEAD request against all the URLs.  From within our
network,
curl gets HTTP 200, success.

curl --request HEAD --user-agent gsa-crawler --verbose
http://world.se
condlife.com.s3.amazonaws.com/classified/16642eda-6954-21d4-
31a2-9b9868221f8d.html

Likewise, the GSA itself can crawl these URLs (at least in
Continuous
Crawl mode).

HTTP 505 is "HTTP Version Not Supported".  The GSA
appears to speak
HTTP 1.0 when I examine our server logs.  Obviously, Amazon
S3
supports HTTP 1.0.

I saw a note on Amazon's forums that their web server is
quite picky
about the exact format of the request string -- in
particular, extra
spaces before the "HTTP/1.0" part of the request
make it choke.  But
as far as I can tell the GSA formats requests correctly.

Has anyone else seen Network Diagnostics reporting HTTP 505
errors
versus URLs on Amazon S3?  Is it a real problem?  Could it
be causing
the GSA to conclude it can't hit my start URLs?

Thanks in advance,

James Cook
Software Engineer
jameslindenlab.com
"The Second Life People"


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Google Search Appliance" group.
To post to this group, send email to
Google-Search-Appliancegooglegroups.com
To unsubscribe from this group, send email to
Google-Search-Appliance-unsubscribegooglegroups.com
For more options, visit this group at http://groups.google.com/group/Google-Search-Applian
ce?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: GSA Network Diagnostics shows HTTP 505 versus Amazon S3?
country flaguser name
United States
2007-09-27 01:20:03
Based on your findings regarding the pickiness of S3 I would
not put
too much store on the status code returned by the URL
tester. It may
be that S3 does not support HEAD requests, or it may be that
it is
rejecting the HEAD request made by the URL tester because it
has an
additional space character after the HTTP/1.0 and before the
newline.
This is a recognised bug with the URL tester but does not
affect the
crawler, so would not explain why the site will not crawl.

Is the Start URL shown in Crawl Diagnostics? If so, what is
that Crawl
Status given? If not, try using the Continuous Crawl for a
few hours
and see if you get anything in Crawl Diagnostics at all.

Thor.

On Sep 27, 3:11 pm, "ja...lindenlab.com"
<james.c...gmail.com>
wrote:
> I'm trying to set up our enterprise Google Search
Appliance to index
> documents hosted on Amazon S3 (Simple Storage
Solution).
>
> While investigating a problem I'm having with Scheduled
Crawls (they
> run on schedule, but take exactly 30 minutes and index
no documents) I
> thought that perhaps the GSA was failing to access
start URLs.  I
> tried putting an S3 URL into the Administration >
Network Diagnostics
> page.  I get this:
>
> http://world.secondlife.com.s3.amazonaws
.com/classified/16642eda-6954...
>          returncode 505, should be 200http://www.google.com/
>          OK
>
> The GSA documentation says that Network Diagnostics
does a ping then
> an HTTP HEAD request against all the URLs.  From within
our network,
> curl gets HTTP 200, success.
>
> curl --request HEAD --user-agent gsa-crawler
--verbosehttp://world.secondlife.com.s3.amazonaws
.com/classified/16642eda-6954...
>
> Likewise, the GSA itself can crawl these URLs (at least
in Continuous
> Crawl mode).
>
> HTTP 505 is "HTTP Version Not Supported". 
The GSA appears to speak
> HTTP 1.0 when I examine our server logs.  Obviously,
Amazon S3
> supports HTTP 1.0.
>
> I saw a note on Amazon's forums that their web server
is quite picky
> about the exact format of the request string -- in
particular, extra
> spaces before the "HTTP/1.0" part of the
request make it choke.  But
> as far as I can tell the GSA formats requests
correctly.
>
> Has anyone else seen Network Diagnostics reporting HTTP
505 errors
> versus URLs on Amazon S3?  Is it a real problem?  Could
it be causing
> the GSA to conclude it can't hit my start URLs?
>
> Thanks in advance,
>
> James Cook
> Software Engineer
> ja...lindenlab.com
> "The Second Life People"


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Google Search Appliance" group.
To post to this group, send email to
Google-Search-Appliancegooglegroups.com
To unsubscribe from this group, send email to
Google-Search-Appliance-unsubscribegooglegroups.com
For more options, visit this group at http://groups.google.com/group/Google-Search-Applian
ce?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: GSA Network Diagnostics shows HTTP 505 versus Amazon S3?
country flaguser name
United States
2007-09-27 17:52:10
Ah, there is an extra space after "HTTP/1.0" in
the network
diagnostics test.  That would explain it.

S3 does support HEAD, at least when I do it from curl.

Crawl Diagnostics show the page being retrieved just fine. 
So it must
be that bug in URL tester.

Thanks for the information!

James


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Google Search Appliance" group.
To post to this group, send email to
Google-Search-Appliancegooglegroups.com
To unsubscribe from this group, send email to
Google-Search-Appliance-unsubscribegooglegroups.com
For more options, visit this group at http://groups.google.com/group/Google-Search-Applian
ce?hl=en
-~----------~----~----~----~------~----~------~--~---


[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )