|
|
| Created: (NUTCH-436) Incorrect handling
of relative paths when the embedded URL
path is empty |

|
2007-01-26 08:09:49 |
|
| Incorrect handling of relative paths when the embedded URL path is empty
------------------------------------------------------------------------
Key: NUTCH-436
URL: https://issues.apache.org/jira/browse/NUTCH-436
Project: Nutch
Issue Type: Bug
Components: fetcher
Reporter: Andrew Groh
Priority: Critical
If you have a base URL of the form:
http://a/b/c/d;p?q#f
Embedded URL Correct Absolute URL Nutch Generated URL
?y http://a/b/c/d;p?y http://a/b/c/?y
;x http://a/b/c/d;x http://a/b/c/;x
See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
http://www.ietf.org/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
|
| Updated: (NUTCH-436) Incorrect handling
of relative paths when the embedded URL
path is empty |

|
2007-01-26 08:13:49 |
|
|
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Groh updated NUTCH-436:
------------------------------
Description:
If you have a base URL of the form:
http://a/b/c/d;p?q#f
Embedded URL: ?y
Correct Absolute URL: http://a/b/c/d;p?y
Nutch Generated URL: http://a/b/c/?y
Embedded URL: ;x
Correct Absolute URL: http://a/b/c/d;x
Nutch Generated URL: http://a/b/c/;x
See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
http://www.ietf.org/rfc/rfc1808.txt
was:
If you have a base URL of the form:
http://a/b/c/d;p?q#f
Embedded URL Correct Absolute URL Nutch Generated URL
?y http://a/b/c/d;p?y http://a/b/c/?y
;x http://a/b/c/d;x http://a/b/c/;x
See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
http://www.ietf.org/rfc/rfc1808.txt
> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
> Key: NUTCH-436
> URL: https://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Priority: Critical
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
|
| Commented: (NUTCH-436) Incorrect
handling of relative paths when the
embedded URL path is emp |

|
2007-01-26 12:37:49 |
|
|
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467883 ]
Andrew Groh commented on NUTCH-436:
-----------------------------------
This is a bug in java.net.URL, specifically the URLStreamClass that it uses.
new URL("http://a/b/c/d;p?q#f ","?y")
creates a URL object with a bad URL.
> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>
> Key: NUTCH-436
> URL: https://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Priority: Critical
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps, and section 5.1 for example
> http://www.ietf.org/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
|
| Assigned: (NUTCH-436) Incorrect
handling of relative paths when the
embedded URL path is empt |
  United States |
2007-03-04 00:08:50 |
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes reassigned NUTCH-436:
----------------------------------
Assignee: Dennis Kubes
> Incorrect handling of relative paths when the embedded
URL path is empty
>
------------------------------------------------------------
------------
>
> Key: NUTCH-436
> URL: https
://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Assigned To: Dennis Kubes
> Priority: Critical
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition
of the correct set of steps, and section 5.1 for example
> http://www.ietf.o
rg/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (NUTCH-436) Incorrect handling
of relative paths when the embedded URL
path is empty |
  United States |
2007-03-04 13:00:51 |
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-436:
-------------------------------
Attachment: NUTCH-436-20070304.patch
NUTCH-436-20070304.patch handles correct encoding of the
params information in the base url. When creating a new
URL,with a base URL and target String path, if the target
contains params information but the base does not then the
java.net.URL class has the correct behavior. If the base
has params information then the URL class strips this
information from the URL. This patch is a workaround that
moves base params information to the target so that it can
be correctly handled by the URL class.
> Incorrect handling of relative paths when the embedded
URL path is empty
>
------------------------------------------------------------
------------
>
> Key: NUTCH-436
> URL: https
://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Assigned To: Dennis Kubes
> Priority: Critical
> Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition
of the correct set of steps, and section 5.1 for example
> http://www.ietf.o
rg/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Resolved: (NUTCH-436) Incorrect
handling of relative paths when the
embedded URL path is empt |
  United States |
2007-03-09 20:38:09 |
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes resolved NUTCH-436.
--------------------------------
Resolution: Fixed
Patch tested on 10,000 URL run with no apparent issues.
Reviewed and committed.
> Incorrect handling of relative paths when the embedded
URL path is empty
>
------------------------------------------------------------
------------
>
> Key: NUTCH-436
> URL: https
://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Assigned To: Dennis Kubes
> Priority: Critical
> Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition
of the correct set of steps, and section 5.1 for example
> http://www.ietf.o
rg/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Closed: (NUTCH-436) Incorrect handling
of relative paths when the embedded URL
path is empty |
  United States |
2007-03-09 20:38:09 |
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.at
lassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes closed NUTCH-436.
------------------------------
Issue closed.
> Incorrect handling of relative paths when the embedded
URL path is empty
>
------------------------------------------------------------
------------
>
> Key: NUTCH-436
> URL: https
://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Assigned To: Dennis Kubes
> Priority: Critical
> Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition
of the correct set of steps, and section 5.1 for example
> http://www.ietf.o
rg/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (NUTCH-436) Incorrect
handling of relative paths when the
embedded URL path is emp |
  United States |
2007-10-16 10:39:50 |
[ https://issues.apache.org/jira/browse/N
UTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanel
s:comment-tabpanel#action_12535272 ]
Doug Cook commented on NUTCH-436:
---------------------------------
It looks like Nutch-566, and associated patch, which I
recently filed, is a duplicate of this.
The patch I proposed may or may not handle the ';'
correctly, I need to check that.
But the patch for this issue (Nutch-436) is limited to
DOMContentUtils, and this problem will exist wherever Sun's
URL class is used in URL extraction -- thus it affects any
parser, not just the HTML one. The same issue occurs in
Javascript link extraction, Flash link extraction, etc. --
thus the patch should be in a centralized location (like
util).
> Incorrect handling of relative paths when the embedded
URL path is empty
>
------------------------------------------------------------
------------
>
> Key: NUTCH-436
> URL: https
://issues.apache.org/jira/browse/NUTCH-436
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Andrew Groh
> Assignee: Dennis Kubes
> Priority: Critical
> Attachments: NUTCH-436-20070304.patch
>
>
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition
of the correct set of steps, and section 5.1 for example
> http://www.ietf.o
rg/rfc/rfc1808.txt
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|