[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-488?PAGE=COM.ATL
ASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#AC
TION_12529382 ]
DO?ACAN GüNEY COMMENTED ON NUTCH-488:
-------------------------------------
LOOKS GOOD BUT A COUPLE OF COMMENTS:
* YOU SHOULD USE CONF.GETSTRINGS INSTEAD OF SPLITTING
VALUE YOURSELF
* THIS LINE:
IF ( ! IGNORETAGS.CONTAINS("FORM") &&
CONF.GETBOOLEAN("PARSER.HTML.FORM.USE_ACTION",
FALSE))
SHOULD PERHAPS CONTAIN AN OR TO KEEP BACKWARD
COMPATIBILITY
* DOING THE "IF ( !IGNORETAGS )" FOR EVERYTHING
IS ERROR PRONE. PERHAPS WE CAN ADD A MAP<TAG-NAME,
TAG-ATTR> TYPE OF THING SO:
FOR (ENTRY<STRING, STRING> ENTRY:
IGNOREMAP.ENTRYSET()) {
STRING TAG = ENTRY.GETKEY();
STRING ATTR = ENTRY.GETVALUE();
LINKPARAMS.PUT(TAG, NEW LINKPARAMS(TAG, ATTR, 0));
}
> AVOID PARSING UNECCESSARY LINKS AND GET A MORE RELEVANT
OUTLINK LIST
>
------------------------------------------------------------
--------
>
> KEY: NUTCH-488
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-488
> PROJECT: NUTCH
> ISSUE TYPE: IMPROVEMENT
> AFFECTS VERSIONS: 0.9.0
> ENVIRONMENT: WINDOWS, JAVA 1.5
> REPORTER: EMMANUEL JOKE
> ATTACHMENTS: DOMCONTENTUTILS.PATCH,
IGNORE_TAGS_V2.PATCH, NUTCH-DEFAULT.XML.PATCH
>
>
> NEKOHTML PARSER USE A METHOD TO EXTRACT ALL OUTLINKS
FROM THE HTML PAGE. IT WILL EXTRACTS THEM FROM THE HTML
CONTENT BASED ON THE LIST OF PARAM DEFINED IN THE METHOD
SETCONF(). THEN THIS LIST OF LINKS WILL BE TRUNCATED TO BE
LIMIT TO THE THE MAXIMUM NUMBER OF OUTLINKS THAT WE'LL
PROCESS FOR A PAGE DEFINED IN NUTCH-DEFAULT.XML
(DB.MAX.OUTLINKS.PER.PAGE = 100 BY DEFAULT ) AND FINALLY IT
WILL BE GO THROUGH ALL URLFILTER DEFINED.
> UNFORTUNETLY IT CAN HAPPEN THAT THE LIST OF OUTLINKS IS
MORE THAN 100, SO IT WILL TRUNCATED THE LIST AND COULD
REMOVE SOME RELEVANT LINKS.
> SO I'VE ADDED FEW OPTIONS IN THE NUTCH-DEFAULT.XML IN
ORDER TO ENABLE/DISABLE THE EXTRACTION OF SPECIFIC HTML TAG
LINKS IN THIS PARSER (SCRIPT, IMG, FORM, LINK).
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|