[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-25?PAGE=COM.ATLA
SSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:ALL-TABPANEL ]
DO?ACAN GüNEY UPDATED NUTCH-25:
-------------------------------
FIX VERSION/S: 1.0.0
ASSIGNEE: DO?ACAN GüNEY
PRIORITY: MAJOR (WAS: TRIVIAL)
ISSUE TYPE: NEW FEATURE (WAS: WISH)
THIS SHOULD BE SOMETHING THAT WE FIX BEFORE 1.0.
> NEEDS 'CHARACTER ENCODING' DETECTOR
> -----------------------------------
>
> KEY: NUTCH-25
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-25
> PROJECT: NUTCH
> ISSUE TYPE: NEW FEATURE
> REPORTER: STEFAN GROSCHUPF
> ASSIGNEE: DO?ACAN GüNEY
> FIX FOR: 1.0.0
>
> ATTACHMENTS: NUTCH-25.PATCH,
NUTCH-25_DRAFT.PATCH
>
>
> TRANSFERRED FROM:
>
HTTP://SOURCEFORGE.NET/TRACKER/INDEX.PHP?FUNC=DETAIL&AID
=995730&GROUP_ID=59548&ATID=491356
> SUBMITTED BY:
> JUNGSHIK SHIN
> THIS IS A FOLLOW-UP TO BUG 993380 (FIGURE OUT
'CHARSET'
> FROM THE META TAG).
> ALTHOUGH WE CAN COVER A LOT OF GROUND USING THE 'C-T'
> HEADER FIELD IN IN THE HTTP HEADER AND THE
> CORRESPONDING META TAG IN HTML DOCUMENTS (AND IN CASE
> OF XML, WE HAVE TO USE A SIMILAR BUT A DIFFERENT
> 'PARSING'), IN THE WILD, THERE ARE A LOT OF DOCUMENTS
> WITHOUT ANY INFORMATION ABOUT THE CHARACTER ENCODING
> USED. BROWSERS LIKE MOZILLA AND SEARCH ENGINES LIKE
> GOOGLE USE CHARACTER ENCODING DETECTORS TO DEAL WITH
> THESE 'UNLABELLED' DOCUMENTS.
> MOZILLA'S CHARACTER ENCODING DETECTOR IS GPL/MPL'D AND
> WE MIGHT BE ABLE TO PORT IT TO JAVA. UNFORTUNATELY,
> IT'S NOT FOOL-PROOF. HOWEVER, ALONG WITH SOME OTHER
> HEURISTIC USED BY MOZILLA AND ELSEWHERE, IT'LL BE
> POSSIBLE TO ACHIEVE A HIGH RATE OF THE DETECTION.
> THE FOLLOWING PAGE HAS LINKS TO SOME OTHER RELATED
PAGES.
> HTTP://TRAINEDMONKEY.COM/WEEK/2004/26
> IN ADDITION TO THE CHARACTER ENCODING DETECTION, WE
> ALSO NEED TO DETECT THE LANGUAGE OF A DOCUMENT, WHICH
> IS EVEN HARDER AND SHOULD BE A SEPARATE BUG (ALTHOUGH
> IT'S RELATED).
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|