List Info

Thread: Following tags




Following <form action> tags
user name
2006-05-17 20:56:45
Gang,

I had a webmaster complain that our crawler was following
his <form action> links. Although he admits that his
use of the GET method is a bit unorthodox, he feels strongly
that form submissions with input fields shouldn't be
followed by crawlers. Would it make sense to modify the HTML
parser so that it checked to see whether such input fields
exist before following <form action> links?

- Chris

At 1:47 PM -0700 5/17/06, Chris Schneider wrote:
>Mark,
>
>At 8:15 AM +1200 5/18/06, Mark Rowe wrote:
>>On 18/05/2006, at 5:39 AM, Chris Schneider wrote:
>>>Thanks for providing the technical details about
your pages. After reviewing the HTML parser used by Nutch,
it appears that specifying either the
rel="nofollow" attribute or the
method="post" attribute would prevent our
crawler (and other Nutch crawlers) from following these
<form action...> links. If I understand your HTML
correctly, it seems like you really are making this call to
retrieve information, so method="get" (the
default) does appear to be more appropriate.
>>
>>You understand incorrectly.  While the page in
question abuses the GET method to perform a mutating action,
I still feel that it is incorrect for it to be followed in
this situation.  (The mutating action in this case is for
another computer to spend up to an hour to download, build
and compile a piece of software -- definitely not
information retrieval.)
>
>I agree that the page in question should not be crawled.
The remaining question is how to prevent that from
happening.
>
>>>Thus, I humbly suggest that you add a
re="nofollow" to these links. This will not only
prevent our crawler from following them, but solve the
problem for Nutch and other crawler technologies that honor
this attribute. Here's some technical information about it:
>>>
>>>http://micr
oformats.org/wiki/rel-nofollow
>>
>>It is my understanding that
'rel="nofollow"' is only valid for <a>
tags, and furthermore does not prevent the crawling of such
links. 
>
>The rel attribute is valid for both <a> and
<form action=...> tags. I see nothing in the
specification restricting rel="nofollow" to
<a> tags, so I would assume that it is valid for
<form action=...> tags as well.
>
>>According to the link you mentioned, it
"indicates that the destination of that hyperlink
SHOULD NOT be afforded any additional weight or ranking by
user agents which perform link analysis upon web
pages".  This doesn't stop a crawler from following
the link, only from inferring a relationship between the
source and destination.
>
>You are correct. However, it does in point of fact
prevent the Nutch crawler's HTML parser from following
<a> and <form action...> tags that have this
attribute. I would imagine that this would prevent other
crawler technologies from following these links as well.
>
>>>If you have specific suggestions for other ways
that Nutch might differentiate links like yours from other
<form action...> links that *are* of potential
interest when crawling, then I could post this to the Nutch
developer group mailing list.
>>
>>In my opinion it seems bizarre to submit a form with
empty input fields in the hope that you will get a valid
page out the other end. 
>
>Perhaps, but many HTML pages still do use this
technique, allowing a button or some other control to load a
second page, etc.
>
>>Submitting a form is, in my mind, a much stronger
action than following a hyperlink.  This applies doubly to
forms with associated input fields.  I can't think of very
few examples of forms off the top of my head where it would
be desirable to crawl the resultant page after submitting
with all inputs empty.
>
>I will post a message to the nutch developer mailing
list describing your suggestion about not following these
<form action...> links if there are input fields in
the form. However, it seems like a lot of work for the
parser. Although I have absolutely no control over the
behavior of other Nutch crawlers out there, I will consider
making a change to our Nutch installation to avoid following
such links.
>
>Best Regards,
>
>- Chris

At 9:51 AM +1200 5/17/06, mrowebdash.net.nz wrote:
>Hi Chris,
>
>An example of the type of form is visible at
>http://build.webkit.org/post-commit-powerpc-mac
-os-x/builds/1921.  The
>markup relevant to the form is:
>
><form action="1921/rebuild"
class="command rebuild">
><div class="row">
>  <span class="label">Your
name:</span>
>  <span class="field"><input
type='text' name='username' /></span>
></div>
><div class="row">
>  <span class="label">Reason for
re-running build:</span>
>  <span class="field"><input
type='text' name='comments' /></span>
></div><input type="submit"
value="Rebuild" />
></form>
>
> When the /rebuild link is activated it causes several
machines within our
>build system to download + recompile our application. 
As you can
>probably appreciate, this is computationally intensive
and is best
>avoided.
>
>Thanks,
>
>Mark
>
>> Mark,
>>
>> We're using the Nutch OpenSource crawler
technology for our crawls and
>> have not modified the algorithm controlling which
areas of HTML pages are
>> searched while harvesting outlinks. Our URL filter
should be preventing us
>> from following links that include queries (i.e.,
those containing a "?"
>> character), though. Could you provide some specific
details about the
>> <form> tag and the embedded URLs within it
that our crawler seems to be
>> following?
>>
>> Thanks,
>>
>> Chris Schneider
>>
>> At 10:07 AM +1200 5/13/06, Mark Rowe wrote:
>>>Hi,
>>>
>>>Your crawler is doing the most insanely stupid
thing possible.  It is
>>> following the URLs in <form> tags.  It is
the *only* web crawler that I
>>> have seen do such a thing, and it is
ridiculous.  Some functionality is
>>> behind form tags for the reason that web
crawlers follow hyperlinks in A
>>> tags, but do not submit forms.  I will be
preventing your crawlers IP
>>> range from accessing my server by firewall
rules until you change this
>>> braindead behaviour.
>>>
>>>Regards,
>>>
>>>Mark Rowe
>>><http://bdash.net.nz/>

-- 
------------------------
Chris Schneider
TransPac Software, Inc.
SchmedTransPac.com
------------------------
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )