[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-25?PAGE=COM.ATLA
SSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#ACT
ION_12531290 ]
HUDSON COMMENTED ON NUTCH-25:
-----------------------------
INTEGRATED IN NUTCH-NIGHTLY #222 (SEE
[HTTP://LUCENE.ZONES.APACHE.ORG:8080/HUDSON/JOB/NUTCH-NIGHTL
Y/222/])
> NEEDS 'CHARACTER ENCODING' DETECTOR
> -----------------------------------
>
> KEY: NUTCH-25
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-25
> PROJECT: NUTCH
> ISSUE TYPE: NEW FEATURE
> REPORTER: STEFAN GROSCHUPF
> ASSIGNEE: DO?ACAN GüNEY
> FIX FOR: 1.0.0
>
> ATTACHMENTS: ENCODINGDETECTOR.JAVA,
ENCODINGDETECTOR_ADDITIVE.JAVA, NUTCH-25.PATCH,
NUTCH-25_DRAFT.PATCH, NUTCH-25_V2.PATCH, NUTCH-25_V3.PATCH,
NUTCH-25_V4.PATCH, PATCH
>
>
> TRANSFERRED FROM:
>
HTTP://SOURCEFORGE.NET/TRACKER/INDEX.PHP?FUNC=DETAIL&AID
=995730&GROUP_ID=59548&ATID=491356
> SUBMITTED BY:
> JUNGSHIK SHIN
> THIS IS A FOLLOW-UP TO BUG 993380 (FIGURE OUT
'CHARSET'
> FROM THE META TAG).
> ALTHOUGH WE CAN COVER A LOT OF GROUND USING THE 'C-T'
> HEADER FIELD IN IN THE HTTP HEADER AND THE
> CORRESPONDING META TAG IN HTML DOCUMENTS (AND IN CASE
> OF XML, WE HAVE TO USE A SIMILAR BUT A DIFFERENT
> 'PARSING'), IN THE WILD, THERE ARE A LOT OF DOCUMENTS
> WITHOUT ANY INFORMATION ABOUT THE CHARACTER ENCODING
> USED. BROWSERS LIKE MOZILLA AND SEARCH ENGINES LIKE
> GOOGLE USE CHARACTER ENCODING DETECTORS TO DEAL WITH
> THESE 'UNLABELLED' DOCUMENTS.
> MOZILLA'S CHARACTER ENCODING DETECTOR IS GPL/MPL'D AND
> WE MIGHT BE ABLE TO PORT IT TO JAVA. UNFORTUNATELY,
> IT'S NOT FOOL-PROOF. HOWEVER, ALONG WITH SOME OTHER
> HEURISTIC USED BY MOZILLA AND ELSEWHERE, IT'LL BE
> POSSIBLE TO ACHIEVE A HIGH RATE OF THE DETECTION.
> THE FOLLOWING PAGE HAS LINKS TO SOME OTHER RELATED
PAGES.
> HTTP://TRAINEDMONKEY.COM/WEEK/2004/26
> IN ADDITION TO THE CHARACTER ENCODING DETECTION, WE
> ALSO NEED TO DETECT THE LANGUAGE OF A DOCUMENT, WHICH
> IS EVEN HARDER AND SHOULD BE A SEPARATE BUG (ALTHOUGH
> IT'S RELATED).
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|