List Info

Thread: Commented: (NUTCH-25) needs 'character encoding' detector




Commented: (NUTCH-25) needs 'character encoding' detector
country flaguser name
United States
2007-09-29 23:18:51
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-25?PAGE=COM.ATLA
SSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#ACT
ION_12531290 ] 

HUDSON COMMENTED ON NUTCH-25:
-----------------------------

INTEGRATED IN NUTCH-NIGHTLY #222 (SEE
[HTTP://LUCENE.ZONES.APACHE.ORG:8080/HUDSON/JOB/NUTCH-NIGHTL
Y/222/])

> NEEDS 'CHARACTER ENCODING' DETECTOR
> -----------------------------------
>
>                 KEY: NUTCH-25
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/NUTCH-25
>             PROJECT: NUTCH
>          ISSUE TYPE: NEW FEATURE
>            REPORTER: STEFAN GROSCHUPF
>            ASSIGNEE: DO?ACAN GüNEY
>             FIX FOR: 1.0.0
>
>         ATTACHMENTS: ENCODINGDETECTOR.JAVA,
ENCODINGDETECTOR_ADDITIVE.JAVA, NUTCH-25.PATCH,
NUTCH-25_DRAFT.PATCH, NUTCH-25_V2.PATCH, NUTCH-25_V3.PATCH,
NUTCH-25_V4.PATCH, PATCH
>
>
> TRANSFERRED FROM:
>
HTTP://SOURCEFORGE.NET/TRACKER/INDEX.PHP?FUNC=DETAIL&AID
=995730&GROUP_ID=59548&ATID=491356
> SUBMITTED BY:
> JUNGSHIK SHIN
> THIS IS A FOLLOW-UP TO BUG 993380 (FIGURE OUT
'CHARSET'
> FROM THE META TAG).
> ALTHOUGH WE CAN COVER A LOT OF GROUND USING THE 'C-T'
> HEADER FIELD IN IN THE HTTP HEADER AND THE
> CORRESPONDING META TAG IN HTML DOCUMENTS (AND IN CASE
> OF XML, WE HAVE TO USE A SIMILAR BUT A DIFFERENT
> 'PARSING'), IN THE WILD, THERE ARE A LOT OF DOCUMENTS
> WITHOUT ANY INFORMATION ABOUT THE CHARACTER ENCODING
> USED. BROWSERS LIKE MOZILLA AND SEARCH ENGINES LIKE
> GOOGLE USE CHARACTER ENCODING DETECTORS TO DEAL WITH
> THESE 'UNLABELLED' DOCUMENTS. 
> MOZILLA'S CHARACTER ENCODING DETECTOR IS GPL/MPL'D AND
> WE MIGHT BE ABLE TO PORT IT TO JAVA. UNFORTUNATELY,
> IT'S NOT FOOL-PROOF. HOWEVER, ALONG WITH SOME OTHER
> HEURISTIC USED BY MOZILLA AND ELSEWHERE, IT'LL BE
> POSSIBLE TO ACHIEVE A HIGH RATE OF THE DETECTION. 
> THE FOLLOWING PAGE HAS LINKS TO SOME OTHER RELATED
PAGES.
> HTTP://TRAINEDMONKEY.COM/WEEK/2004/26
> IN ADDITION TO THE CHARACTER ENCODING DETECTION, WE
> ALSO NEED TO DETECT THE LANGUAGE OF A DOCUMENT, WHICH
> IS EVEN HARDER AND SHOULD BE A SEPARATE BUG (ALTHOUGH
> IT'S RELATED).

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


[1]

about | contact  Other archives ( Real Estate discussion Medical topics )