[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-505?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12512071 ]
ESPEN AMBLE KOLSTAD COMMENTED ON NUTCH-505:
-------------------------------------------
AUTOMATON (HTTP://WWW.BRICS.DK/AUTOMATON/), USED IN
AUTOMATONURLFILTER, IS EVEN FASTER IF YOU PREPARSE THE
REGEX'ES
IT DOESN'T SUPPORT ALL REGEX, BUT MOST.
> OUTLINK URLS SHOULD BE VALIDATED
> --------------------------------
>
> KEY: NUTCH-505
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-505
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> REPORTER: DO?ACAN GüNEY
> ASSIGNEE: DO?ACAN GüNEY
> PRIORITY: MINOR
> FIX FOR: 1.0.0
>
> ATTACHMENTS: NUTCH-505-V2.PATCH,
NUTCH-505.PATCH, NUTCH-505.PATCH, NUTCH-505_DRAFT.PATCH,
NUTCH-505_DRAFT_V2.PATCH
>
>
> SEE DISCUSSION HERE:
>
HTTP://WWW.NABBLE.COM/FETCHING-HTTP%3A--WWW.VARIETY.COM-%3C-
DIV%3E%3C-A%3E-TF3961692.HTML
> PARSE PLUGINS MAY EXTRACT GARBAGE URLS FROM PAGES. WE
NEED A URL VALIDATION SYSTEM THAT TESTS THESE URLS AND
FILTERS OUT GARBAGE.
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|