List Info

Thread: Created: (NUTCH-566) Sun's URL class has bug in creation of relative query URLs




Created: (NUTCH-566) Sun's URL class has bug in creation of relative query URLs
country flaguser name
United States
2007-10-10 10:56:50
Sun's URL class has bug in creation of relative query URLs
----------------------------------------------------------

                 Key: NUTCH-566
                 URL: https
://issues.apache.org/jira/browse/NUTCH-566
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0, 0.8.1, 0.8
         Environment: MacOS X and Linux (CentOS 4.5) both
            Reporter: Doug Cook
            Priority: Minor


I'm using 0.81, but this will affect all other versions as
well.

Relative links of the form "?blah" are resolved
incorrectly. For example, with a base URL of http://www.fleu
rie.org/entreprise.asp, and a relative link of
"?id_entrep=111", Nutch will resolve this pair to
the link
"http://w
ww.fleurie.org/?id_entrep=111". No such URL exists,
and all browsers I tried will resolve the pair to "http://www.fleurie.org/entreprise.asp?id_entrep=111&qu
ot;.

I tracked this down to what could be called a bug in Sun's
URL class. According to Sun's spec, they parse the relative
URL according to RFC 2396. But the original RFC for relative
links was RFC 1808, and the two RFCs differ in how they
handle relative links beginning with "?". Most
browsers (Netscape/Mozilla, IE, Safari) implemented RFC
1808, and stuck with it (for compatibility and also because
the behavior makes more sense). Apparently even the people
that wrote RFC 2396 recognized that this was a mistake, and
the specified behavior was changed in RFC 3986 to match what
browsers do. 

For a discussion of this, see  http://gbiv.com/protocols/uri/rev-2002/issu
es.html#003-relative-query

Sun's URL implementation, however, still implements RFC2396,
as far as I can tell, and is out of step with the rest of
the world.
This breaks link extraction on a number of sites.

I implemented a simple workaround, which I'm attaching. It
is a static method to create URLs which behaves exactly as
new URL(URL base, String relativePath), and I use it as a
drop-in replacement for that in DOMContentUtils, Javascript
link extraction, etc. Obviously, it really only matters
wherever links are extracted. I haven't included the calling
code from DOMContentUtils, etc. because my local versions
are largely rewritten, but it should be pretty obvious.

I put it in the org.apache.nutch.net directory, but
obviously feel free to move it to another place if you feel
it belongs there!




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (NUTCH-566) Sun's URL class has bug in creation of relative query URLs
country flaguser name
United States
2007-10-10 10:58:50
     [ https://issues.apache.org/jira/browse/NUTCH-566?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doug Cook updated NUTCH-566:
----------------------------

    Attachment: RelativeURL.java

Here's a static method to work around the problem.

> Sun's URL class has bug in creation of relative query
URLs
>
----------------------------------------------------------
>
>                 Key: NUTCH-566
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-566
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: MacOS X and Linux (CentOS 4.5)
both
>            Reporter: Doug Cook
>            Priority: Minor
>         Attachments: RelativeURL.java
>
>
> I'm using 0.81, but this will affect all other versions
as well.
> Relative links of the form "?blah" are
resolved incorrectly. For example, with a base URL of http://www.fleu
rie.org/entreprise.asp, and a relative link of
"?id_entrep=111", Nutch will resolve this pair to
the link
> "http://w
ww.fleurie.org/?id_entrep=111". No such URL exists,
and all browsers I tried will resolve the pair to "http://www.fleurie.org/entreprise.asp?id_entrep=111&qu
ot;.
> I tracked this down to what could be called a bug in
Sun's URL class. According to Sun's spec, they parse the
relative URL according to RFC 2396. But the original RFC for
relative links was RFC 1808, and the two RFCs differ in how
they handle relative links beginning with "?".
Most browsers (Netscape/Mozilla, IE, Safari) implemented RFC
1808, and stuck with it (for compatibility and also because
the behavior makes more sense). Apparently even the people
that wrote RFC 2396 recognized that this was a mistake, and
the specified behavior was changed in RFC 3986 to match what
browsers do. 
> For a discussion of this, see  http://gbiv.com/protocols/uri/rev-2002/issu
es.html#003-relative-query
> Sun's URL implementation, however, still implements
RFC2396, as far as I can tell, and is out of step with the
rest of the world.
> This breaks link extraction on a number of sites.
> I implemented a simple workaround, which I'm attaching.
It is a static method to create URLs which behaves exactly
as new URL(URL base, String relativePath), and I use it as a
drop-in replacement for that in DOMContentUtils, Javascript
link extraction, etc. Obviously, it really only matters
wherever links are extracted. I haven't included the calling
code from DOMContentUtils, etc. because my local versions
are largely rewritten, but it should be pretty obvious.
> I put it in the org.apache.nutch.net directory, but
obviously feel free to move it to another place if you feel
it belongs there!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )