List Info

Thread: Re: Updated: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter




Re: Updated: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-16 06:23:37
I wouldn't pretend to know the truth on this matter, but you
might 
update the wikipedia article http://en.wiki
pedia.org/wiki/Diacritic if 
you do, as it does not agree with your comments.

Marko Asplund (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/LUCENE-1029?page=co
m.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Marko Asplund updated LUCENE-1029:
> ----------------------------------
>
>     Attachment: ISOLatin1AccentFilter-javadoc.patch
>
> I think the class javadoc is very misleading so I'm
attaching a documentation patch.
>
> For one the scandinavian characters do not contain
diacritical marks or accents.  The dots in ä and ö as well
as the ring in å is considered part of the letter, not
diacritics. The class name implies that it does something
with accents so for this reason I would not have expected
the class to replace the scandinavian characters.
>
> The javadoc also says it replaces characters with their
"equivalent" ASCII characters. There are no
equivalents for the scandinavian characters.
>
>
>   
>> Illegal character replacements in
ISOLatin1AccentFilter
>>
-------------------------------------------------------
>>
>>                 Key: LUCENE-1029
>>                 URL: htt
ps://issues.apache.org/jira/browse/LUCENE-1029
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: Analysis
>>    Affects Versions: 2.2
>>            Reporter: Marko Asplund
>>         Attachments:
ISOLatin1AccentFilter-javadoc.patch
>>
>>
>> The ISOLatin1AccentFilter class is responsible for
replacing "accented characters in the ISO Latin 1
character set by their unaccented equivalent".
>> Some of the replacements performed for scandinavian
characters (used e.g. in the finnish, swedish, danish
languages etc.) are illegal. The scandinavian characters are
different from the accented characters used e.g. in latin
based languages such as french in that these characters (ä,
ö, å) represent entirely independent sounds in the
language and therefore cannot be represented with any other
sound without change of meaning. It is therefore illegal to
replace these characters with any other character.
>> This means for example that you can't change the
finnish word sää (weather) to saa (will have) because
these are two entirely different words with different
meaning. The same applies to scandinavian languages as
well.
>> There's no connection between the sounds
represented by ä and a; ö and o or å and a. 
>> In addition to the three characters mentioned above
danish and norwegian use other special characters such as ø
and æ. It should be checked if the replacement is legal for
these characters.
>>     
>
>   

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Re: Updated: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-16 11:55:42
Mark Miller wrote:
> I wouldn't pretend to know the truth on this matter,
but you might 
> update the wikipedia article http://en.wiki
pedia.org/wiki/Diacritic if 
> you do, as it does not agree with your comments.

Wikipedia says, "Swedish uses characters identical to
a-diaeresis (ä) 
and o-diaeresis (ö)".  This is a little ambiguous. 
Identical how?  I 
think they mean "visually identical to".  The
distinction is whether 
Swedish treats 'ä' as a variant of 'a' or as a completely
separate 
letter.  The latter is the case.

http://en.wikipe
dia.org/wiki/Umlaut_(diacritic) states:

   Swedish [...] treat[s] them as independent letters.

Doug

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )