List Info

Thread: Created: (NUTCH-504) NUTCH-443 broke parsing during fetching




Created: (NUTCH-504) NUTCH-443 broke parsing during fetching
country flaguser name
United States
2007-06-22 03:30:25
NUTCH-443 BROKE PARSING DURING FETCHING
---------------------------------------

                 KEY: NUTCH-504
                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-504
             PROJECT: NUTCH
          ISSUE TYPE: BUG
          COMPONENTS: FETCHER
    AFFECTS VERSIONS: 1.0.0
            REPORTER: DO?ACAN GüNEY
             FIX FOR: 1.0.0


AFTER NUTCH-443, IF ONE IS PARSING DURING FETCHING AND
PARSING FOR A URL FAILS, THAT URL DOESN'T GET SEGMENT NAME
OR SIMILAR PROPERTIES IN ITS METADATA. BECAUSE OF THIS,
INDEXER FAILS (BECAUSE, INDEX EXPECTS TO SEE SEGMENT NAME
FOR ALL PARSES, EVEN THOSE THAT FAILED).

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


Updated: (NUTCH-504) NUTCH-443 broke parsing during fetching
country flaguser name
United States
2007-06-22 03:32:25
     [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-504?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]

DO?ACAN GüNEY UPDATED NUTCH-504:
--------------------------------

    ATTACHMENT: PARSE_IN_FETCHERS.PATCH

PATCH FOR THE PROBLEM. I THINK IT WOULD BE NICE TO ADD A
TEST CASE FOR THIS, BUT I AM NOT SURE HOW WE CAN FORCE A
PARSE TO FAIL SO WE CAN TEST IT PROPERLY(COMMENTS ARE
WELCOME. 



> NUTCH-443 BROKE PARSING DURING FETCHING
> ---------------------------------------
>
>                 KEY: NUTCH-504
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-504
>             PROJECT: NUTCH
>          ISSUE TYPE: BUG
>          COMPONENTS: FETCHER
>    AFFECTS VERSIONS: 1.0.0
>            REPORTER: DO?ACAN GüNEY
>             FIX FOR: 1.0.0
>
>         ATTACHMENTS: PARSE_IN_FETCHERS.PATCH
>
>
> AFTER NUTCH-443, IF ONE IS PARSING DURING FETCHING AND
PARSING FOR A URL FAILS, THAT URL DOESN'T GET SEGMENT NAME
OR SIMILAR PROPERTIES IN ITS METADATA. BECAUSE OF THIS,
INDEXER FAILS (BECAUSE, INDEX EXPECTS TO SEE SEGMENT NAME
FOR ALL PARSES, EVEN THOSE THAT FAILED).

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


Commented: (NUTCH-504) NUTCH-443 broke parsing during fetching
country flaguser name
United States
2007-06-22 03:34:25
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-504?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12507162 ] 

DO?ACAN GüNEY COMMENTED ON NUTCH-504:
-------------------------------------

ALSO, SHOULD WE ACTUALLY INDEX DOCUMENTS EVEN IF THEIR
PARSES HAVE FAILED? SINCE, WHEN A URL FAILS WE REPLACE ITS
PARSE WITH AN EMPTY PARSE ANYWAY, IT MAY BE A GOOD IDEA TO
SKIP SUCH DOCUMENTS.

> NUTCH-443 BROKE PARSING DURING FETCHING
> ---------------------------------------
>
>                 KEY: NUTCH-504
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-504
>             PROJECT: NUTCH
>          ISSUE TYPE: BUG
>          COMPONENTS: FETCHER
>    AFFECTS VERSIONS: 1.0.0
>            REPORTER: DO?ACAN GüNEY
>             FIX FOR: 1.0.0
>
>         ATTACHMENTS: PARSE_IN_FETCHERS.PATCH
>
>
> AFTER NUTCH-443, IF ONE IS PARSING DURING FETCHING AND
PARSING FOR A URL FAILS, THAT URL DOESN'T GET SEGMENT NAME
OR SIMILAR PROPERTIES IN ITS METADATA. BECAUSE OF THIS,
INDEXER FAILS (BECAUSE, INDEX EXPECTS TO SEE SEGMENT NAME
FOR ALL PARSES, EVEN THOSE THAT FAILED).

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


Commented: (NUTCH-504) NUTCH-443 broke parsing during fetching
country flaguser name
United States
2007-06-22 03:49:26
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-504?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12507168 ] 

ANDRZEJ BIALECKI  COMMENTED ON NUTCH-504:
-----------------------------------------

+1 - WE SHOULD SKIP DOCUMENTS THAT FAILED TO PARSE PROPERLY,
IN SUCH CASES WE HAVE NO USABLE TEXT ANYWAY.

> NUTCH-443 BROKE PARSING DURING FETCHING
> ---------------------------------------
>
>                 KEY: NUTCH-504
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-504
>             PROJECT: NUTCH
>          ISSUE TYPE: BUG
>          COMPONENTS: FETCHER
>    AFFECTS VERSIONS: 1.0.0
>            REPORTER: DO?ACAN GüNEY
>             FIX FOR: 1.0.0
>
>         ATTACHMENTS: PARSE_IN_FETCHERS.PATCH
>
>
> AFTER NUTCH-443, IF ONE IS PARSING DURING FETCHING AND
PARSING FOR A URL FAILS, THAT URL DOESN'T GET SEGMENT NAME
OR SIMILAR PROPERTIES IN ITS METADATA. BECAUSE OF THIS,
INDEXER FAILS (BECAUSE, INDEX EXPECTS TO SEE SEGMENT NAME
FOR ALL PARSES, EVEN THOSE THAT FAILED).

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


Updated: (NUTCH-504) NUTCH-443 broke parsing during fetching
country flaguser name
United States
2007-06-22 07:24:26
     [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-504?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]

DO?ACAN GüNEY UPDATED NUTCH-504:
--------------------------------

    ATTACHMENT: NUTCH-504_V2.PATCH

NEW VERSION.

* INCLUDES OLDER PATCH.
* INDEXER FILTERS UNSUCCESSFUL PARSES.
* UPDATED TESTFETCHER UNIT CASE, TESTFETCHER NOW FAILS
WITHOUT THIS PATCH.
* ALSO ADDED A HTTP.ROBOTS.AGENTS PROPERTY TO
SRC/TEST/CRAWL-TESTS.XML. WITHOUT THIS, TESTFETCHER LOGS A
FATAL ROBOTRULEPARSER ERROR(WHICH DOESN'T CAUSE TESTFETCHER
TO FAIL BUT IS STILL ANNOYING).

> NUTCH-443 BROKE PARSING DURING FETCHING
> ---------------------------------------
>
>                 KEY: NUTCH-504
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-504
>             PROJECT: NUTCH
>          ISSUE TYPE: BUG
>          COMPONENTS: FETCHER
>    AFFECTS VERSIONS: 1.0.0
>            REPORTER: DO?ACAN GüNEY
>             FIX FOR: 1.0.0
>
>         ATTACHMENTS: NUTCH-504_V2.PATCH,
PARSE_IN_FETCHERS.PATCH
>
>
> AFTER NUTCH-443, IF ONE IS PARSING DURING FETCHING AND
PARSING FOR A URL FAILS, THAT URL DOESN'T GET SEGMENT NAME
OR SIMILAR PROPERTIES IN ITS METADATA. BECAUSE OF THIS,
INDEXER FAILS (BECAUSE, INDEX EXPECTS TO SEE SEGMENT NAME
FOR ALL PARSES, EVEN THOSE THAT FAILED).

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )