List Info

Thread: Created: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty




Created: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty
user name
2007-01-26 08:09:49
Incorrect handling of relative paths when the embedded URL path is empty ------------------------------------------------------------------------ Key: NUTCH-436 URL: https://issues.apache.org/jira/browse/NUTCH-436 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Andrew Groh Priority: Critical If you have a base URL of the form: http://a/b/c/d;p?q#f Embedded URL Correct Absolute URL Nutch Generated URL ?y http://a/b/c/d;p?y http://a/b/c/?y ;x http://a/b/c/d;x http://a/b/c/;x See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example http://www.ietf.org/rfc/rfc1808.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Updated: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty
user name
2007-01-26 08:13:49
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Groh updated NUTCH-436: ------------------------------ Description: If you have a base URL of the form: http://a/b/c/d;p?q#f Embedded URL: ?y Correct Absolute URL: http://a/b/c/d;p?y Nutch Generated URL: http://a/b/c/?y Embedded URL: ;x Correct Absolute URL: http://a/b/c/d;x Nutch Generated URL: http://a/b/c/;x See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example http://www.ietf.org/rfc/rfc1808.txt was: If you have a base URL of the form: http://a/b/c/d;p?q#f Embedded URL Correct Absolute URL Nutch Generated URL ?y http://a/b/c/d;p?y http://a/b/c/?y ;x http://a/b/c/d;x http://a/b/c/;x See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example http://www.ietf.org/rfc/rfc1808.txt > Incorrect handling of relative paths when the embedded URL path is empty > ------------------------------------------------------------------------ > > Key: NUTCH-436 > URL: https://issues.apache.org/jira/browse/NUTCH-436 > Project: Nutch > Issue Type: Bug > Components: fetcher > Reporter: Andrew Groh > Priority: Critical > > If you have a base URL of the form: > http://a/b/c/d;p?q#f > Embedded URL: ?y > Correct Absolute URL: http://a/b/c/d;p?y > Nutch Generated URL: http://a/b/c/?y > Embedded URL: ;x > Correct Absolute URL: http://a/b/c/d;x > Nutch Generated URL: http://a/b/c/;x > See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example > http://www.ietf.org/rfc/rfc1808.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Commented: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is emp
user name
2007-01-26 12:37:49
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467883 ] Andrew Groh commented on NUTCH-436: ----------------------------------- This is a bug in java.net.URL, specifically the URLStreamClass that it uses. new URL("http://a/b/c/d;p?q#f ","?y") creates a URL object with a bad URL. > Incorrect handling of relative paths when the embedded URL path is empty > ------------------------------------------------------------------------ > > Key: NUTCH-436 > URL: https://issues.apache.org/jira/browse/NUTCH-436 > Project: Nutch > Issue Type: Bug > Components: fetcher > Reporter: Andrew Groh > Priority: Critical > > If you have a base URL of the form: > http://a/b/c/d;p?q#f > Embedded URL: ?y > Correct Absolute URL: http://a/b/c/d;p?y > Nutch Generated URL: http://a/b/c/?y > Embedded URL: ;x > Correct Absolute URL: http://a/b/c/d;x > Nutch Generated URL: http://a/b/c/;x > See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example > http://www.ietf.org/rfc/rfc1808.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Assigned: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empt
country flaguser name
United States
2007-03-04 00:08:50
     [ https://issues.apache.org/jira/browse/NUTCH-436?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes reassigned NUTCH-436:
----------------------------------

    Assignee: Dennis Kubes

> Incorrect handling of relative paths when the embedded
URL path is empty
>
------------------------------------------------------------
------------
>
>                 Key: NUTCH-436
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>         Assigned To: Dennis Kubes
>            Priority: Critical
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition
of the correct set of steps, and section 5.1 for example
> http://www.ietf.o
rg/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty
country flaguser name
United States
2007-03-04 13:00:51
     [ https://issues.apache.org/jira/browse/NUTCH-436?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes updated NUTCH-436:
-------------------------------

    Attachment: NUTCH-436-20070304.patch

NUTCH-436-20070304.patch handles correct encoding of the
params information in the base url.  When creating a new
URL,with a base URL and target String path, if the target
contains params information but the base does not then the
java.net.URL class  has the correct behavior.  If the base
has params information then the URL class strips this
information from the URL.  This patch is a workaround that
moves base params information to the target so that it can
be correctly handled by the URL class.

> Incorrect handling of relative paths when the embedded
URL path is empty
>
------------------------------------------------------------
------------
>
>                 Key: NUTCH-436
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>         Assigned To: Dennis Kubes
>            Priority: Critical
>         Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition
of the correct set of steps, and section 5.1 for example
> http://www.ietf.o
rg/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Resolved: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empt
country flaguser name
United States
2007-03-09 20:38:09
     [ https://issues.apache.org/jira/browse/NUTCH-436?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes resolved NUTCH-436.
--------------------------------

    Resolution: Fixed

Patch tested on 10,000 URL run with no apparent issues. 
Reviewed and committed.

> Incorrect handling of relative paths when the embedded
URL path is empty
>
------------------------------------------------------------
------------
>
>                 Key: NUTCH-436
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>         Assigned To: Dennis Kubes
>            Priority: Critical
>         Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition
of the correct set of steps, and section 5.1 for example
> http://www.ietf.o
rg/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Closed: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty
country flaguser name
United States
2007-03-09 20:38:09
     [ https://issues.apache.org/jira/browse/NUTCH-436?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes closed NUTCH-436.
------------------------------


Issue closed.

> Incorrect handling of relative paths when the embedded
URL path is empty
>
------------------------------------------------------------
------------
>
>                 Key: NUTCH-436
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>         Assigned To: Dennis Kubes
>            Priority: Critical
>         Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition
of the correct set of steps, and section 5.1 for example
> http://www.ietf.o
rg/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is emp
country flaguser name
United States
2007-10-16 10:39:50
    [ https://issues.apache.org/jira/browse/N
UTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12535272 ] 

Doug Cook commented on NUTCH-436:
---------------------------------

It looks like Nutch-566, and associated patch, which I
recently filed, is a duplicate of this.

The patch I proposed may or may not handle the ';'
correctly, I need to check that.

But the patch for this issue (Nutch-436) is limited to
DOMContentUtils, and this problem will exist wherever Sun's
URL class is used in URL extraction -- thus it affects any
parser, not just the HTML one. The same issue occurs in
Javascript link extraction, Flash link extraction, etc. --
thus the patch should be in a centralized location (like
util).


> Incorrect handling of relative paths when the embedded
URL path is empty
>
------------------------------------------------------------
------------
>
>                 Key: NUTCH-436
>                 URL: https
://issues.apache.org/jira/browse/NUTCH-436
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>            Assignee: Dennis Kubes
>            Priority: Critical
>         Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition
of the correct set of steps, and section 5.1 for example
> http://www.ietf.o
rg/rfc/rfc1808.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


[1-8]

about | contact  Other archives ( Real Estate discussion Medical topics )