|
List Info
Thread: Created: (NUTCH-547) Redirection handling: YahooSlurp's algorithm
|
|
| Created: (NUTCH-547) Redirection
handling: YahooSlurp's algorithm |
  United States |
2007-09-03 02:47:18 |
REDIRECTION HANDLING: YAHOOSLURP'S ALGORITHM
--------------------------------------------
KEY: NUTCH-547
URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547
PROJECT: NUTCH
ISSUE TYPE: IMPROVEMENT
COMPONENTS: FETCHER
REPORTER: DO?ACAN GüNEY
FIX FOR: 1.0.0
AFTER READING YAHOO'S ALGORITHM (THEN ONE ANDRZEJ LINKED
TO:
HTTP://HELP.YAHOO.COM/L/NZ/YAHOOXTRA/SEARCH/WEBCRAWLER/SLURP
-11.HTML )
IN THE REDIRECT/ALIAS HANDLING DISCUSSION, I HAD A BIT OF A
SPARE
TIME, SO I IMPLEMENTED IT.
NOTE THAT THE PATCH I AM ATTACHING IS FOR THE 'CHOOSING'
ALGORITHM DESCRIBED IN
YAHOO'S HELP PAGE. IT MAKES NO ATTEMPT TO HANDLE ALIASES IN
ANY WAY. (SEE
HTTP://WWW.NABBLE.COM/REDIRECTS-AND-ALIAS-HANDLING-%28LONG%2
9-TF4270371.HTML#A12154362 FOR THE DISCUSSION ABOUT ALIAS
HANDLING).
E.G,
GENERATE "HTTP://WWW.MILLIYET.COM.TR/"
FETCH "HTTP:/WWW.MILLIYET.COM.TR/" WHICH REDIRECTS
TO
"HTTP://WWW.MILLIYET.COM.TR/2007/08/29/INDEX.HTML?VER=3
9".
UPDATE SECOND PAGE'S DATUM'S METADATA TO INDICATE THAT
"HTTP://WWW.MILLIYET.COM.TR/" IS THE
REPRESENTATIVE FORM.
UPDATEDB, INVERTLINKS, ETC...
WHILE INDEXING SECOND PAGE, CHANGE ITS "URL" FIELD
TO
"HTTP://WWW.MILLIYET.COM.TR/".
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Updated: (NUTCH-547) Redirection
handling: YahooSlurp's algorithm |
  United States |
2007-09-03 02:49:19 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]
DO?ACAN GüNEY UPDATED NUTCH-547:
--------------------------------
ATTACHMENT: REDIRECT_DRAFT.PATCH
DRAFT VERSION.
> REDIRECTION HANDLING: YAHOOSLURP'S ALGORITHM
> --------------------------------------------
>
> KEY: NUTCH-547
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: FETCHER
> REPORTER: DO?ACAN GüNEY
> FIX FOR: 1.0.0
>
> ATTACHMENTS: REDIRECT_DRAFT.PATCH
>
>
> AFTER READING YAHOO'S ALGORITHM (THEN ONE ANDRZEJ
LINKED TO:
>
HTTP://HELP.YAHOO.COM/L/NZ/YAHOOXTRA/SEARCH/WEBCRAWLER/SLURP
-11.HTML )
> IN THE REDIRECT/ALIAS HANDLING DISCUSSION, I HAD A BIT
OF A SPARE
> TIME, SO I IMPLEMENTED IT.
> NOTE THAT THE PATCH I AM ATTACHING IS FOR THE
'CHOOSING' ALGORITHM DESCRIBED IN
> YAHOO'S HELP PAGE. IT MAKES NO ATTEMPT TO HANDLE
ALIASES IN ANY WAY. (SEE
HTTP://WWW.NABBLE.COM/REDIRECTS-AND-ALIAS-HANDLING-%28LONG%2
9-TF4270371.HTML#A12154362 FOR THE DISCUSSION ABOUT ALIAS
HANDLING).
> E.G,
> GENERATE "HTTP://WWW.MILLIYET.COM.TR/"
> FETCH "HTTP:/WWW.MILLIYET.COM.TR/" WHICH
REDIRECTS TO
>
"HTTP://WWW.MILLIYET.COM.TR/2007/08/29/INDEX.HTML?VER=3
9".
> UPDATE SECOND PAGE'S DATUM'S METADATA TO INDICATE THAT
> "HTTP://WWW.MILLIYET.COM.TR/" IS THE
REPRESENTATIVE FORM.
> UPDATEDB, INVERTLINKS, ETC...
> WHILE INDEXING SECOND PAGE, CHANGE ITS "URL"
FIELD TO
> "HTTP://WWW.MILLIYET.COM.TR/".
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-547) Redirection
handling: YahooSlurp's algorithm |
  United States |
2007-09-03 13:14:57 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12524567 ]
ANDRZEJ BIALECKI COMMENTED ON NUTCH-547:
-----------------------------------------
A FEW COMMENTS:
* THE PATCH USES A STRANGE DIFF FORMAT ... THE FIRST LINES
OF CONTEXT DIFFS APPEAR ON THE SAME LINES AS CHUNK
COORDINATES.
* IN FETCHER[2].HANDLEREDIRECT(), WHAT HAPPENS WHEN THE
SELECTED REPRURL IS THE SAME AS THE URLSTRING? WE SHOULD
SKIP THE REDIRECT THEN.
* THE REPEATING PARSING OF REFRESHTIME SHOULD BE HIDDEN IN A
UTILITY METHOD IN PARSESTATUS - ALTHOUGH THE PROPER WAY TO
SUPPORT THIS WOULD BE TO EXTEND PARSESTATUS TO STORE THIS
INT VALUE IF NECESSARY, I.E. IF PARSESTATUS IS
SUCCESS_REDIRECT (WE WOULD HAVE TO BUMP THE VERSION NUMBER,
TOO).
* MINIMUM REFRESHTIME SHOULD BE AT LEAST A CONSTANT, OR
CONFIGURABLE, AND NOT A LITERAL. SIMILARLY THE REDIRTYPE
SHOULD BE A CONSTANT.
* PARSING OF THE REDIRECT TIME SHOULD BE MOVED IMHO TO
HANDLEREDIRECT(), TO SIMPLIFY THE LOGIC IN THE
FETCHERTHREAD.RUN().
* IF WE CHANGE THE "URL" FIELD IN
BASICINDEXINGFILTER, SHOULDN'T WE ALSO CHANGE THE
"SITE"AND "HOST" FIELDS? WE COULD ALSO
CONSIDER ADDING REPRURL AS AN ADDITIONAL VALUE FOR THE SAME
"URL" FIELD - THIS WAY WE WOULD GET HITS BOTH ON
THE ORIGINAL URL AND THE REPRURL.
* I'M NOT SURE WHY THE PATCH TO INDEXER.JAVA TRIES TO
OVERWRITE REPRURL FROM FETCHDATUM WITH THE VALUE FROM
DBDATUM - IF ANYTHING, THE VALUE IN FETCHDATUM SHOULD BE
MORE UP TO DATE, NO? AS IT IS NOW, IT'S SILENTLY
OVERWRITTEN. THE ONLY WAY THE REPRURL COULD END UP IN
DBDATUM IS FROM A PREVIOUS UPDATEDB OPERATION, SO IT SHOULD
CONTAIN AN OLDER INFORMATION.
> REDIRECTION HANDLING: YAHOOSLURP'S ALGORITHM
> --------------------------------------------
>
> KEY: NUTCH-547
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: FETCHER
> REPORTER: DO?ACAN GüNEY
> FIX FOR: 1.0.0
>
> ATTACHMENTS: REDIRECT_DRAFT.PATCH
>
>
> AFTER READING YAHOO'S ALGORITHM (THEN ONE ANDRZEJ
LINKED TO:
>
HTTP://HELP.YAHOO.COM/L/NZ/YAHOOXTRA/SEARCH/WEBCRAWLER/SLURP
-11.HTML )
> IN THE REDIRECT/ALIAS HANDLING DISCUSSION, I HAD A BIT
OF A SPARE
> TIME, SO I IMPLEMENTED IT.
> NOTE THAT THE PATCH I AM ATTACHING IS FOR THE
'CHOOSING' ALGORITHM DESCRIBED IN
> YAHOO'S HELP PAGE. IT MAKES NO ATTEMPT TO HANDLE
ALIASES IN ANY WAY. (SEE
HTTP://WWW.NABBLE.COM/REDIRECTS-AND-ALIAS-HANDLING-%28LONG%2
9-TF4270371.HTML#A12154362 FOR THE DISCUSSION ABOUT ALIAS
HANDLING).
> E.G,
> GENERATE "HTTP://WWW.MILLIYET.COM.TR/"
> FETCH "HTTP:/WWW.MILLIYET.COM.TR/" WHICH
REDIRECTS TO
>
"HTTP://WWW.MILLIYET.COM.TR/2007/08/29/INDEX.HTML?VER=3
9".
> UPDATE SECOND PAGE'S DATUM'S METADATA TO INDICATE THAT
> "HTTP://WWW.MILLIYET.COM.TR/" IS THE
REPRESENTATIVE FORM.
> UPDATEDB, INVERTLINKS, ETC...
> WHILE INDEXING SECOND PAGE, CHANGE ITS "URL"
FIELD TO
> "HTTP://WWW.MILLIYET.COM.TR/".
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-547) Redirection
handling: YahooSlurp's algorithm |
  United States |
2007-09-04 06:42:48 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12524693 ]
DO?ACAN GüNEY COMMENTED ON NUTCH-547:
-------------------------------------
THANKS A LOT FOR THE QUICK REVIEW, ANDRZEJ.
> * THE PATCH USES A STRANGE DIFF FORMAT ... THE FIRST
LINES OF CONTEXT DIFFS APPEAR ON THE SAME LINES AS CHUNK
COORDINATES.
SORRY ABOUT THAT. I AM USING GIT-SVN (WHICH, BY THE WAY, IS
AN AWESOME TOOL) TO DEVELOP NUTCH SO I MAY HAVE FORGOTTEN TO
USE "SVN DIFF" FOR THE PATCH.
> * IN FETCHER[2].HANDLEREDIRECT(), WHAT HAPPENS WHEN THE
SELECTED REPRURL IS THE SAME AS THE URLSTRING? WE SHOULD
SKIP THE
> REDIRECT THEN.
WE DON'T FOLLOW REPRURL, WE FOLLOW NEWURL WHICH IS TESTED
FOR EQUALITY WITH URLSTRING. HOWEVER, WE SHOULD PROBABLY
AVOID WRITING REPRURL IN CRAWLDATUM METADATA IF IT IS THE
SAME AS THE URLSTRING.
> * THE REPEATING PARSING OF REFRESHTIME SHOULD BE HIDDEN
IN A UTILITY METHOD IN PARSESTATUS - ALTHOUGH THE PROPER WAY
TO
> SUPPORT THIS WOULD BE TO EXTEND PARSESTATUS TO STORE
THIS INT VALUE IF NECESSARY, I.E. IF PARSESTATUS IS
SUCCESS_REDIRECT (WE
> WOULD HAVE TO BUMP THE VERSION NUMBER, TOO).
GOOD POINT. WILL LOOK INTO THAT.
> * MINIMUM REFRESHTIME SHOULD BE AT LEAST A CONSTANT, OR
CONFIGURABLE, AND NOT A LITERAL. SIMILARLY THE REDIRTYPE
SHOULD BE A
> CONSTANT.
THIS PATCH IS ONLY A ROUGH DRAFT. I WILL FIX ALL SUCH ISSUES
IN A LATER PATCH.
> * IF WE CHANGE THE "URL" FIELD IN
BASICINDEXINGFILTER, SHOULDN'T WE ALSO CHANGE THE
"SITE"AND "HOST" FIELDS? [...]
WOW, CAN'T BELIEVE I MISSED THAT.
> [..] WE COULD ALSO CONSIDER ADDING REPRURL AS AN
ADDITIONAL VALUE FOR THE SAME "URL" FIELD - THIS
WAY WE WOULD GET HITS BOTH ON
> THE ORIGINAL URL AND THE REPRURL.
THIS MAY CAUSE PROBLEMS WITH DEDUP WHICH ASSUMES THAT
"URL" FIELD HAS A SINGLE VALUE. ALSO, IT MAY BE
DIFFICULT TO DECIDE WHICH VALUE OF "URL" TO SHOW
IN WEB UI. I ALSO LIKE THAT FACT THAT "URL" IS
LIKE A UNIQUE KEY FOR THE DOCUMENT. IF WE ALLOW
"URL" TO HAVE MULTIPLE VALUES WE LOSE THAT.
PERHAPS WE CAN ADD REPRURL TO A "REPR" FIELD
INSTEAD?
> REDIRECTION HANDLING: YAHOOSLURP'S ALGORITHM
> --------------------------------------------
>
> KEY: NUTCH-547
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: FETCHER
> REPORTER: DO?ACAN GüNEY
> FIX FOR: 1.0.0
>
> ATTACHMENTS: REDIRECT_DRAFT.PATCH
>
>
> AFTER READING YAHOO'S ALGORITHM (THEN ONE ANDRZEJ
LINKED TO:
>
HTTP://HELP.YAHOO.COM/L/NZ/YAHOOXTRA/SEARCH/WEBCRAWLER/SLURP
-11.HTML )
> IN THE REDIRECT/ALIAS HANDLING DISCUSSION, I HAD A BIT
OF A SPARE
> TIME, SO I IMPLEMENTED IT.
> NOTE THAT THE PATCH I AM ATTACHING IS FOR THE
'CHOOSING' ALGORITHM DESCRIBED IN
> YAHOO'S HELP PAGE. IT MAKES NO ATTEMPT TO HANDLE
ALIASES IN ANY WAY. (SEE
HTTP://WWW.NABBLE.COM/REDIRECTS-AND-ALIAS-HANDLING-%28LONG%2
9-TF4270371.HTML#A12154362 FOR THE DISCUSSION ABOUT ALIAS
HANDLING).
> E.G,
> GENERATE "HTTP://WWW.MILLIYET.COM.TR/"
> FETCH "HTTP:/WWW.MILLIYET.COM.TR/" WHICH
REDIRECTS TO
>
"HTTP://WWW.MILLIYET.COM.TR/2007/08/29/INDEX.HTML?VER=3
9".
> UPDATE SECOND PAGE'S DATUM'S METADATA TO INDICATE THAT
> "HTTP://WWW.MILLIYET.COM.TR/" IS THE
REPRESENTATIVE FORM.
> UPDATEDB, INVERTLINKS, ETC...
> WHILE INDEXING SECOND PAGE, CHANGE ITS "URL"
FIELD TO
> "HTTP://WWW.MILLIYET.COM.TR/".
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-547) Redirection
handling: YahooSlurp's algorithm |
  United States |
2007-09-10 15:25:29 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12526258 ]
ANDRZEJ BIALECKI COMMENTED ON NUTCH-547:
-----------------------------------------
> > I'M NOT SURE WHY THE PATCH TO INDEXER.JAVA TRIES
TO OVERWRITE REPRURL FROM FETCHDATUM WITH THE VALUE FROM
DBDATUM [..]
I'M STILL NOT SURE ABOUT THIS ISSUE - COULD YOU PLEASE
CLARIFY?
> PERHAPS WE CAN ADD REPRURL TO A "REPR" FIELD
INSTEAD?
SHOULDN'T THIS BE THE OTHER WAY AROUND - THE IDEA OF YOUR
PATCH IS TO PUT THE DATA UNDER THE REPRURL, SO IN ORDER TO
MINIMIZE CODE CHANGES YOU REPLACE THE ORIGINAL URL WITH
REPRURL. THIS WAY WE LOSE THE VALUE OF THE ORIGINAL URL, SO
IT SEEMS TO ME THAT IF WE WANT TO PRESERVE IT WE SHOULD ADD
IT TO AN "ORIG" FIELD ..
> REDIRECTION HANDLING: YAHOOSLURP'S ALGORITHM
> --------------------------------------------
>
> KEY: NUTCH-547
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: FETCHER
> REPORTER: DO?ACAN GüNEY
> FIX FOR: 1.0.0
>
> ATTACHMENTS: REDIRECT_DRAFT.PATCH
>
>
> AFTER READING YAHOO'S ALGORITHM (THEN ONE ANDRZEJ
LINKED TO:
>
HTTP://HELP.YAHOO.COM/L/NZ/YAHOOXTRA/SEARCH/WEBCRAWLER/SLURP
-11.HTML )
> IN THE REDIRECT/ALIAS HANDLING DISCUSSION, I HAD A BIT
OF A SPARE
> TIME, SO I IMPLEMENTED IT.
> NOTE THAT THE PATCH I AM ATTACHING IS FOR THE
'CHOOSING' ALGORITHM DESCRIBED IN
> YAHOO'S HELP PAGE. IT MAKES NO ATTEMPT TO HANDLE
ALIASES IN ANY WAY. (SEE
HTTP://WWW.NABBLE.COM/REDIRECTS-AND-ALIAS-HANDLING-%28LONG%2
9-TF4270371.HTML#A12154362 FOR THE DISCUSSION ABOUT ALIAS
HANDLING).
> E.G,
> GENERATE "HTTP://WWW.MILLIYET.COM.TR/"
> FETCH "HTTP:/WWW.MILLIYET.COM.TR/" WHICH
REDIRECTS TO
>
"HTTP://WWW.MILLIYET.COM.TR/2007/08/29/INDEX.HTML?VER=3
9".
> UPDATE SECOND PAGE'S DATUM'S METADATA TO INDICATE THAT
> "HTTP://WWW.MILLIYET.COM.TR/" IS THE
REPRESENTATIVE FORM.
> UPDATEDB, INVERTLINKS, ETC...
> WHILE INDEXING SECOND PAGE, CHANGE ITS "URL"
FIELD TO
> "HTTP://WWW.MILLIYET.COM.TR/".
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (NUTCH-547) Redirection
handling: YahooSlurp's algorithm |
  United States |
2007-09-10 15:44:29 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12526263 ]
DO?ACAN GüNEY COMMENTED ON NUTCH-547:
-------------------------------------
>> > I'M NOT SURE WHY THE PATCH TO INDEXER.JAVA
TRIES TO OVERWRITE REPRURL FROM FETCHDATUM WITH THE VALUE
FROM DBDATUM [..]
>
> I'M STILL NOT SURE ABOUT THIS ISSUE - COULD YOU PLEASE
CLARIFY?
SORRY, IT SEEMS I FORGOT TO ANSWER IT
IT IS POSSIBLE THAT WE DISCOVER A META-REDIRECT DURING PARSE
PHASE. WE HAVE NO WAY OF UPDATING FETCH DATUM-S AT THIS
POINT, SO INSTEAD, PARSE WRITES THIS INFORMATION TO
CRAWL_PARSE WHICH OF COURSE IS THEN PASSED TO CRAWLDB. SO,
DURING INDEXING, IT IS POSSIBLE THAT DBDATUM CONTAINS
(META-)REDIRECT INFORMATION WHILE FETCHDATUM DOES NOT. BUT
YOU ARE RIGHT THAT WE SHOULD PROBABLY GIVE SOME SORT OF
PRIORITY TO FETCHDATUM'S METADATA OVER DBDATUM.
> PERHAPS WE CAN ADD REPRURL TO A "REPR" FIELD
INSTEAD?
>
> SHOULDN'T THIS BE THE OTHER WAY AROUND - THE IDEA OF
YOUR PATCH IS TO PUT THE DATA UNDER THE REPRURL, SO IN ORDER
TO MINIMIZE CODE CHANGES YOU
> REPLACE THE ORIGINAL URL WITH REPRURL. THIS WAY WE LOSE
THE VALUE OF THE ORIGINAL URL, SO IT SEEMS TO ME THAT IF WE
WANT TO PRESERVE IT WE SHOULD ADD IT
> TO AN "ORIG" FIELD ..
OK, THIS MAKES SENSE TO ME. I GUESS WE SHOULD MAKE
"ORIG" BOTH INDEXED AND STORED?
---
BTW, ONE OF THE MAJOR ISSUES WITH REDIRECTION (THAT THIS
PATCH DOES NOT SOLVE) IS THAT SCORES/OTHER INFORMATION ARE
NOT REFLECTED IN REDIRECTIONS. ASSUME FOO.COM IS A MAJOR WEB
SITE. URL HTTP://WWW.FOO.COM/ REDIRECTS TO
HTTP://WWW.FOO.COM/DAILY.HTML . PEOPLE, NATURALLY, ARE MUCH
MORE LIKELY TO LINK TO HTTP://WWW.FOO.COM/ THEN
HTTP://WWW.FOO.COM/DAILY.HTML (THE PROBLEM IS EVEN MORE
INTERESTING IF HTTP://FOO.COM ALSO POINTS TO
HTTP://WWW.FOO.COM/DAILY.HTML ). SO, I THINK WE MUST HAVE
SOME A WAY TO "PASS" THE SCORE FROM SOURCE SITE TO
REDIRECTION SITE. SAME THING FOR ADAPTIVE CRAWLS: IT MAY
LOOK LIKE WWW.FOO.COM NEVER CHANGES (SINCE IT JUST REDIRECTS
TO A DIFFERENT URL). BUT IT SHOULD BE CONSIDERED
"MODIFIED" WHENEVER PAGE AT REDIRECT URL IS
UPDATED.
I AM NOT SURE HOW WE CAN ACHIEVE THIS, THOUGH. WE WILL
PROBABLY NEED AN EXTRA JOB (THAT SHOULD RUN AT LEAST ONCE
BEFORE INDEXING) THAT MERGES INFORMATION FROM SUCH PAGES.
> REDIRECTION HANDLING: YAHOOSLURP'S ALGORITHM
> --------------------------------------------
>
> KEY: NUTCH-547
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: FETCHER
> REPORTER: DO?ACAN GüNEY
> FIX FOR: 1.0.0
>
> ATTACHMENTS: REDIRECT_DRAFT.PATCH
>
>
> AFTER READING YAHOO'S ALGORITHM (THEN ONE ANDRZEJ
LINKED TO:
>
HTTP://HELP.YAHOO.COM/L/NZ/YAHOOXTRA/SEARCH/WEBCRAWLER/SLURP
-11.HTML )
> IN THE REDIRECT/ALIAS HANDLING DISCUSSION, I HAD A BIT
OF A SPARE
> TIME, SO I IMPLEMENTED IT.
> NOTE THAT THE PATCH I AM ATTACHING IS FOR THE
'CHOOSING' ALGORITHM DESCRIBED IN
> YAHOO'S HELP PAGE. IT MAKES NO ATTEMPT TO HANDLE
ALIASES IN ANY WAY. (SEE
HTTP://WWW.NABBLE.COM/REDIRECTS-AND-ALIAS-HANDLING-%28LONG%2
9-TF4270371.HTML#A12154362 FOR THE DISCUSSION ABOUT ALIAS
HANDLING).
> E.G,
> GENERATE "HTTP://WWW.MILLIYET.COM.TR/"
> FETCH "HTTP:/WWW.MILLIYET.COM.TR/" WHICH
REDIRECTS TO
>
"HTTP://WWW.MILLIYET.COM.TR/2007/08/29/INDEX.HTML?VER=3
9".
> UPDATE SECOND PAGE'S DATUM'S METADATA TO INDICATE THAT
> "HTTP://WWW.MILLIYET.COM.TR/" IS THE
REPRESENTATIVE FORM.
> UPDATEDB, INVERTLINKS, ETC...
> WHILE INDEXING SECOND PAGE, CHANGE ITS "URL"
FIELD TO
> "HTTP://WWW.MILLIYET.COM.TR/".
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Updated: (NUTCH-547) Redirection
handling: YahooSlurp's algorithm |
  United States |
2007-09-20 09:06:31 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]
DO?ACAN GüNEY UPDATED NUTCH-547:
--------------------------------
ATTACHMENT: REDIRECT_DRAFT_V2.PATCH
DRAFT 2
* REFRESH TIME IS NOW STORED AS AN EXTRA ARGUMENT.
* DURING INDEXING, IF "URL" FIELD IS CHANGED,
"HOST" AND "SITE" FIELDS ARE UPDATED
ACCORDINGLY.
* MADE REFRESHTIME MAGIC NUMBER A CONSTANT. MADE REDIRTYPE
A CONSTANT.
> REDIRECTION HANDLING: YAHOOSLURP'S ALGORITHM
> --------------------------------------------
>
> KEY: NUTCH-547
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-547
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> COMPONENTS: FETCHER
> REPORTER: DO?ACAN GüNEY
> FIX FOR: 1.0.0
>
> ATTACHMENTS: REDIRECT_DRAFT.PATCH,
REDIRECT_DRAFT_V2.PATCH
>
>
> AFTER READING YAHOO'S ALGORITHM (THEN ONE ANDRZEJ
LINKED TO:
>
HTTP://HELP.YAHOO.COM/L/NZ/YAHOOXTRA/SEARCH/WEBCRAWLER/SLURP
-11.HTML )
> IN THE REDIRECT/ALIAS HANDLING DISCUSSION, I HAD A BIT
OF A SPARE
> TIME, SO I IMPLEMENTED IT.
> NOTE THAT THE PATCH I AM ATTACHING IS FOR THE
'CHOOSING' ALGORITHM DESCRIBED IN
> YAHOO'S HELP PAGE. IT MAKES NO ATTEMPT TO HANDLE
ALIASES IN ANY WAY. (SEE
HTTP://WWW.NABBLE.COM/REDIRECTS-AND-ALIAS-HANDLING-%28LONG%2
9-TF4270371.HTML#A12154362 FOR THE DISCUSSION ABOUT ALIAS
HANDLING).
> E.G,
> GENERATE "HTTP://WWW.MILLIYET.COM.TR/"
> FETCH "HTTP:/WWW.MILLIYET.COM.TR/" WHICH
REDIRECTS TO
>
"HTTP://WWW.MILLIYET.COM.TR/2007/08/29/INDEX.HTML?VER=3
9".
> UPDATE SECOND PAGE'S DATUM'S METADATA TO INDICATE THAT
> "HTTP://WWW.MILLIYET.COM.TR/" IS THE
REPRESENTATIVE FORM.
> UPDATEDB, INVERTLINKS, ETC...
> WHILE INDEXING SECOND PAGE, CHANGE ITS "URL"
FIELD TO
> "HTTP://WWW.MILLIYET.COM.TR/".
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
[1-7]
|
|