List Info

Thread: Created: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter




Created: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-15 02:26:51
ILLEGAL CHARACTER REPLACEMENTS IN ISOLATIN1ACCENTFILTER
-------------------------------------------------------

                 KEY: LUCENE-1029
                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029
             PROJECT: LUCENE - JAVA
          ISSUE TYPE: BUG
          COMPONENTS: ANALYSIS
    AFFECTS VERSIONS: 2.2
            REPORTER: MARKO ASPLUND


THE ISOLATIN1ACCENTFILTER CLASS IS RESPONSIBLE FOR REPLACING
"ACCENTED CHARACTERS IN THE ISO LATIN 1 CHARACTER SET
BY THEIR UNACCENTED EQUIVALENT".

SOME OF THE REPLACEMENTS PERFORMED FOR SCANDINAVIAN
CHARACTERS (USED E.G. IN THE FINNISH, SWEDISH, DANISH
LANGUAGES ETC.) ARE ILLEGAL. THE SCANDINAVIAN CHARACTERS ARE
DIFFERENT FROM THE ACCENTED CHARACTERS USED E.G. IN LATIN
BASED LANGUAGES SUCH AS FRENCH IN THAT THESE CHARACTERS (ä,
ö, å) REPRESENT ENTIRELY INDEPENDENT SOUNDS IN THE LANGUAGE
AND THEREFORE CANNOT BE REPRESENTED WITH ANY OTHER SOUND
WITHOUT CHANGE OF MEANING. IT IS THEREFORE ILLEGAL TO
REPLACE THESE CHARACTERS WITH ANY OTHER CHARACTER.

THIS MEANS FOR EXAMPLE THAT YOU CAN'T CHANGE THE FINNISH
WORD Sää (WEATHER) TO SAA (WILL HAVE) BECAUSE THESE ARE TWO
ENTIRELY DIFFERENT WORDS WITH DIFFERENT MEANING. THE SAME
APPLIES TO SCANDINAVIAN LANGUAGES AS WELL.

THERE'S NO CONNECTION BETWEEN THE SOUNDS REPRESENTED BY ä
AND A; ö AND O OR å AND A. 

IN ADDITION TO THE THREE CHARACTERS MENTIONED ABOVE DANISH
AND NORWEGIAN USE OTHER SPECIAL CHARACTERS SUCH AS ø AND æ.
IT SHOULD BE CHECKED IF THE REPLACEMENT IS LEGAL FOR THESE
CHARACTERS.



-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


------------------------------------------------------------
---------
TO UNSUBSCRIBE, E-MAIL: JAVA-DEV-UNSUBSCRIBELUCENE.APACHE.ORG
FOR ADDITIONAL COMMANDS, E-MAIL: JAVA-DEV-HELPLUCENE.APACHE.ORG


Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-15 05:40:50
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029?PAGE=COM.A
TLASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#
ACTION_12534797 ] 

UWE SCHINDLER COMMENTED ON LUCENE-1029:
---------------------------------------

THIS IS TRUE FOR OTHER EUROPEAN LANGUAGES, TOO. IN GERMANY
IT IS ALSO A DIFFERENCE BETWEEN "ä" AND
"A" (IT SOUNDS DIFFERENT). A CORRECT REPLACEMENT
IN GERMAN WOULD BE TO REPLACE "ä" BY
"AE" (TWO CHARS).
BUT I THINK IT IS NOT A PROBLEM. THE REAL USE OF THIS FILTER
IS TO ENABLE PEOPLE COMING FROM OTHER COUNTRIES WITHOUT THE
KEYS ON THEIR KEYBOARD TO SEARCH IN A LUCENE INDEX. MANY
AMERICANS FOR EXAMPLE SEARCH FOR THE GERMAN LAST NAME
"MüLLER" ALWAYS BY TYPING "MULLER",
BECAUSE THEY CANNOT ENTER THE UMLAUT. IN SCANDIANIAN
LANGUAGES IT WILL BE THE SAME, THEY WOULD ENTER
"O" INSTEAD OF "ø". THE ACCENT FILTER IS
JUST TO ENABLE THIS. IF YOU CREATE AN INDEX JUST FOR ONE
SCANDINAVIAN COUNTRY, JUST LEAVE THIS FILTER OUT.
AND IN PRINCIPLE IT IS NO PROBLEM TO FIND DOCUMENTS THAT
DOES NOT MATCH THE ENTERED KEYWORDS EXACT. 
THE FILTER IS THE SAME LIKE THE SOUNDEX FILTER. AFTER A
TRANSFORMATION TO SOUNDEX THE WORD LOKKS DIFFERENT AND HAS
NEVER HIS ORIGINAL MEANING 

> ILLEGAL CHARACTER REPLACEMENTS IN
ISOLATIN1ACCENTFILTER
>
-------------------------------------------------------
>
>                 KEY: LUCENE-1029
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029
>             PROJECT: LUCENE - JAVA
>          ISSUE TYPE: BUG
>          COMPONENTS: ANALYSIS
>    AFFECTS VERSIONS: 2.2
>            REPORTER: MARKO ASPLUND
>
> THE ISOLATIN1ACCENTFILTER CLASS IS RESPONSIBLE FOR
REPLACING "ACCENTED CHARACTERS IN THE ISO LATIN 1
CHARACTER SET BY THEIR UNACCENTED EQUIVALENT".
> SOME OF THE REPLACEMENTS PERFORMED FOR SCANDINAVIAN
CHARACTERS (USED E.G. IN THE FINNISH, SWEDISH, DANISH
LANGUAGES ETC.) ARE ILLEGAL. THE SCANDINAVIAN CHARACTERS ARE
DIFFERENT FROM THE ACCENTED CHARACTERS USED E.G. IN LATIN
BASED LANGUAGES SUCH AS FRENCH IN THAT THESE CHARACTERS (ä,
ö, å) REPRESENT ENTIRELY INDEPENDENT SOUNDS IN THE LANGUAGE
AND THEREFORE CANNOT BE REPRESENTED WITH ANY OTHER SOUND
WITHOUT CHANGE OF MEANING. IT IS THEREFORE ILLEGAL TO
REPLACE THESE CHARACTERS WITH ANY OTHER CHARACTER.
> THIS MEANS FOR EXAMPLE THAT YOU CAN'T CHANGE THE
FINNISH WORD Sää (WEATHER) TO SAA (WILL HAVE) BECAUSE THESE
ARE TWO ENTIRELY DIFFERENT WORDS WITH DIFFERENT MEANING. THE
SAME APPLIES TO SCANDINAVIAN LANGUAGES AS WELL.
> THERE'S NO CONNECTION BETWEEN THE SOUNDS REPRESENTED BY
ä AND A; ö AND O OR å AND A. 
> IN ADDITION TO THE THREE CHARACTERS MENTIONED ABOVE
DANISH AND NORWEGIAN USE OTHER SPECIAL CHARACTERS SUCH AS ø
AND æ. IT SHOULD BE CHECKED IF THE REPLACEMENT IS LEGAL FOR
THESE CHARACTERS.

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


------------------------------------------------------------
---------
TO UNSUBSCRIBE, E-MAIL: JAVA-DEV-UNSUBSCRIBELUCENE.APACHE.ORG
FOR ADDITIONAL COMMANDS, E-MAIL: JAVA-DEV-HELPLUCENE.APACHE.ORG


Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-15 06:19:50
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029?PAGE=COM.A
TLASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#
ACTION_12534800 ] 

MARKO ASPLUND COMMENTED ON LUCENE-1029:
---------------------------------------

I HAVE TO DISAGREE, I THINK IT'S A PROBLEM THAT THE FILTER
MAKES ILLEGAL CHARACTER REPLACEMENTS.
SOUNDEX MATCH IS DIFFERENT SINCE BY DEFINITION IT'S ALL
ABOUT NON-EXACT OR APPROXIMATE MATCHING.

IN SOME LANGUAGES ACCENTED CHARACTERS MAY HAVE EQUIVALENT
UNACCENTED CHARACTERS WITH WHICH THE ACCENTED ONES MAY BE
REPLACED WITHOUT CHANGE OR LOSS OF MEANING.
SOME OF THE ISOLATIN1ACCENTFILTER ARE LEGAL WHILE OTHERS ARE
ILLEGAL. THE ILLEGAL ONES SHOULD BE FIXED.


> ILLEGAL CHARACTER REPLACEMENTS IN
ISOLATIN1ACCENTFILTER
>
-------------------------------------------------------
>
>                 KEY: LUCENE-1029
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029
>             PROJECT: LUCENE - JAVA
>          ISSUE TYPE: BUG
>          COMPONENTS: ANALYSIS
>    AFFECTS VERSIONS: 2.2
>            REPORTER: MARKO ASPLUND
>
> THE ISOLATIN1ACCENTFILTER CLASS IS RESPONSIBLE FOR
REPLACING "ACCENTED CHARACTERS IN THE ISO LATIN 1
CHARACTER SET BY THEIR UNACCENTED EQUIVALENT".
> SOME OF THE REPLACEMENTS PERFORMED FOR SCANDINAVIAN
CHARACTERS (USED E.G. IN THE FINNISH, SWEDISH, DANISH
LANGUAGES ETC.) ARE ILLEGAL. THE SCANDINAVIAN CHARACTERS ARE
DIFFERENT FROM THE ACCENTED CHARACTERS USED E.G. IN LATIN
BASED LANGUAGES SUCH AS FRENCH IN THAT THESE CHARACTERS (ä,
ö, å) REPRESENT ENTIRELY INDEPENDENT SOUNDS IN THE LANGUAGE
AND THEREFORE CANNOT BE REPRESENTED WITH ANY OTHER SOUND
WITHOUT CHANGE OF MEANING. IT IS THEREFORE ILLEGAL TO
REPLACE THESE CHARACTERS WITH ANY OTHER CHARACTER.
> THIS MEANS FOR EXAMPLE THAT YOU CAN'T CHANGE THE
FINNISH WORD Sää (WEATHER) TO SAA (WILL HAVE) BECAUSE THESE
ARE TWO ENTIRELY DIFFERENT WORDS WITH DIFFERENT MEANING. THE
SAME APPLIES TO SCANDINAVIAN LANGUAGES AS WELL.
> THERE'S NO CONNECTION BETWEEN THE SOUNDS REPRESENTED BY
ä AND A; ö AND O OR å AND A. 
> IN ADDITION TO THE THREE CHARACTERS MENTIONED ABOVE
DANISH AND NORWEGIAN USE OTHER SPECIAL CHARACTERS SUCH AS ø
AND æ. IT SHOULD BE CHECKED IF THE REPLACEMENT IS LEGAL FOR
THESE CHARACTERS.

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


------------------------------------------------------------
---------
TO UNSUBSCRIBE, E-MAIL: JAVA-DEV-UNSUBSCRIBELUCENE.APACHE.ORG
FOR ADDITIONAL COMMANDS, E-MAIL: JAVA-DEV-HELPLUCENE.APACHE.ORG


Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-15 06:37:51
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029?PAGE=COM.A
TLASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#
ACTION_12534804 ] 

MARK MILLER COMMENTED ON LUCENE-1029:
-------------------------------------

I THINK UWE NAILED THIS ONE. STRIPPING ACCENTS IN GENERAL IS
JUST NOT "LEGAL". BUT MANY TIMES IT IS DESIRABLE.
THIS FILTER DOES THAT FOR YOU. IT GOES WITHOUT SAYING THAT
IF YOU STRIP THE ACCENT YOU CHANGE THE MEANING...LIKEWISE,
WHEN YOU STEM A WORD YOU CREATE ILLEGAL WORDS...

> ILLEGAL CHARACTER REPLACEMENTS IN
ISOLATIN1ACCENTFILTER
>
-------------------------------------------------------
>
>                 KEY: LUCENE-1029
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029
>             PROJECT: LUCENE - JAVA
>          ISSUE TYPE: BUG
>          COMPONENTS: ANALYSIS
>    AFFECTS VERSIONS: 2.2
>            REPORTER: MARKO ASPLUND
>
> THE ISOLATIN1ACCENTFILTER CLASS IS RESPONSIBLE FOR
REPLACING "ACCENTED CHARACTERS IN THE ISO LATIN 1
CHARACTER SET BY THEIR UNACCENTED EQUIVALENT".
> SOME OF THE REPLACEMENTS PERFORMED FOR SCANDINAVIAN
CHARACTERS (USED E.G. IN THE FINNISH, SWEDISH, DANISH
LANGUAGES ETC.) ARE ILLEGAL. THE SCANDINAVIAN CHARACTERS ARE
DIFFERENT FROM THE ACCENTED CHARACTERS USED E.G. IN LATIN
BASED LANGUAGES SUCH AS FRENCH IN THAT THESE CHARACTERS (ä,
ö, å) REPRESENT ENTIRELY INDEPENDENT SOUNDS IN THE LANGUAGE
AND THEREFORE CANNOT BE REPRESENTED WITH ANY OTHER SOUND
WITHOUT CHANGE OF MEANING. IT IS THEREFORE ILLEGAL TO
REPLACE THESE CHARACTERS WITH ANY OTHER CHARACTER.
> THIS MEANS FOR EXAMPLE THAT YOU CAN'T CHANGE THE
FINNISH WORD Sää (WEATHER) TO SAA (WILL HAVE) BECAUSE THESE
ARE TWO ENTIRELY DIFFERENT WORDS WITH DIFFERENT MEANING. THE
SAME APPLIES TO SCANDINAVIAN LANGUAGES AS WELL.
> THERE'S NO CONNECTION BETWEEN THE SOUNDS REPRESENTED BY
ä AND A; ö AND O OR å AND A. 
> IN ADDITION TO THE THREE CHARACTERS MENTIONED ABOVE
DANISH AND NORWEGIAN USE OTHER SPECIAL CHARACTERS SUCH AS ø
AND æ. IT SHOULD BE CHECKED IF THE REPLACEMENT IS LEGAL FOR
THESE CHARACTERS.

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


------------------------------------------------------------
---------
TO UNSUBSCRIBE, E-MAIL: JAVA-DEV-UNSUBSCRIBELUCENE.APACHE.ORG
FOR ADDITIONAL COMMANDS, E-MAIL: JAVA-DEV-HELPLUCENE.APACHE.ORG


Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-15 06:47:50
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029?PAGE=COM.A
TLASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#
ACTION_12534810 ] 

KARL WETTIN COMMENTED ON LUCENE-1029:
-------------------------------------

I'M ON MARKOS LINE HERE.

IF YOU ARE TO COMPARE WITH STEMMERS, CONSIDER THAT THESE
CREATES UNIQUE TOKENS THAT DOES NOT INTERFERE WITH SEMANTIC
MEANINGS.

WITH THE ACCENT FILTER, RUNNING THE SWEDISH WORD
"KöN" THROUGH THE FILTER WOULD CREATE
"KON". THE FIRST MEANS "GENDER" AND THE
SECOND "COW". THAT WOULD NOT BE ACCETABLE.

I SAY THIS FILTER NEEDS TO BE MORE CONFIGURABLE.



> ILLEGAL CHARACTER REPLACEMENTS IN
ISOLATIN1ACCENTFILTER
>
-------------------------------------------------------
>
>                 KEY: LUCENE-1029
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029
>             PROJECT: LUCENE - JAVA
>          ISSUE TYPE: BUG
>          COMPONENTS: ANALYSIS
>    AFFECTS VERSIONS: 2.2
>            REPORTER: MARKO ASPLUND
>
> THE ISOLATIN1ACCENTFILTER CLASS IS RESPONSIBLE FOR
REPLACING "ACCENTED CHARACTERS IN THE ISO LATIN 1
CHARACTER SET BY THEIR UNACCENTED EQUIVALENT".
> SOME OF THE REPLACEMENTS PERFORMED FOR SCANDINAVIAN
CHARACTERS (USED E.G. IN THE FINNISH, SWEDISH, DANISH
LANGUAGES ETC.) ARE ILLEGAL. THE SCANDINAVIAN CHARACTERS ARE
DIFFERENT FROM THE ACCENTED CHARACTERS USED E.G. IN LATIN
BASED LANGUAGES SUCH AS FRENCH IN THAT THESE CHARACTERS (ä,
ö, å) REPRESENT ENTIRELY INDEPENDENT SOUNDS IN THE LANGUAGE
AND THEREFORE CANNOT BE REPRESENTED WITH ANY OTHER SOUND
WITHOUT CHANGE OF MEANING. IT IS THEREFORE ILLEGAL TO
REPLACE THESE CHARACTERS WITH ANY OTHER CHARACTER.
> THIS MEANS FOR EXAMPLE THAT YOU CAN'T CHANGE THE
FINNISH WORD Sää (WEATHER) TO SAA (WILL HAVE) BECAUSE THESE
ARE TWO ENTIRELY DIFFERENT WORDS WITH DIFFERENT MEANING. THE
SAME APPLIES TO SCANDINAVIAN LANGUAGES AS WELL.
> THERE'S NO CONNECTION BETWEEN THE SOUNDS REPRESENTED BY
ä AND A; ö AND O OR å AND A. 
> IN ADDITION TO THE THREE CHARACTERS MENTIONED ABOVE
DANISH AND NORWEGIAN USE OTHER SPECIAL CHARACTERS SUCH AS ø
AND æ. IT SHOULD BE CHECKED IF THE REPLACEMENT IS LEGAL FOR
THESE CHARACTERS.

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


------------------------------------------------------------
---------
TO UNSUBSCRIBE, E-MAIL: JAVA-DEV-UNSUBSCRIBELUCENE.APACHE.ORG
FOR ADDITIONAL COMMANDS, E-MAIL: JAVA-DEV-HELPLUCENE.APACHE.ORG


Issue Comment Edited: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-15 06:49:50
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029?PAGE=COM.A
TLASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#
ACTION_12534804 ] 

MARKRMILLERGMAIL.COM EDITED COMMENT ON LUCENE-1029 AT 10/15/07
4:47 AM:
------------------------------------------------------------
---

I THINK UWE NAILED THIS ONE. STRIPPING ACCENTS IN GENERAL IS
JUST NOT "LEGAL". BUT MANY TIMES IT IS DESIRABLE.
THIS FILTER DOES THAT FOR YOU. IT GOES WITHOUT SAYING THAT
IF YOU STRIP THE ACCENT YOU CHANGE THE MEANING...LIKEWISE,
WHEN YOU STEM A WORD YOU CREATE ILLEGAL WORDS...

P.S.

CHANGING THIS FILTER IS NOT REALLY A GREAT OPTION AS IT
WOULD BREAK INDEXES OUT THERE THAT USE IT. I THINK THE
BETTER IDEA WOULD BE TO CREATE A NEW STRIPPER THAT HAS THE
ALTERNATE FUNCTIONALITY THAT YOU ARE THINKING OF -- RATHER
THAN STRIPPING ACCENTS, REPLACE ACCENTED CHARACTERS WITH
LETTERS THAT APPROXIMATE THE ORIGINAL SOUND/MEANING.

      WAS (AUTHOR: MARKRMILLERGMAIL.COM):
    I THINK UWE NAILED THIS ONE. STRIPPING ACCENTS IN
GENERAL IS JUST NOT "LEGAL". BUT MANY TIMES IT IS
DESIRABLE. THIS FILTER DOES THAT FOR YOU. IT GOES WITHOUT
SAYING THAT IF YOU STRIP THE ACCENT YOU CHANGE THE
MEANING...LIKEWISE, WHEN YOU STEM A WORD YOU CREATE ILLEGAL
WORDS...
  
> ILLEGAL CHARACTER REPLACEMENTS IN
ISOLATIN1ACCENTFILTER
>
-------------------------------------------------------
>
>                 KEY: LUCENE-1029
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029
>             PROJECT: LUCENE - JAVA
>          ISSUE TYPE: BUG
>          COMPONENTS: ANALYSIS
>    AFFECTS VERSIONS: 2.2
>            REPORTER: MARKO ASPLUND
>
> THE ISOLATIN1ACCENTFILTER CLASS IS RESPONSIBLE FOR
REPLACING "ACCENTED CHARACTERS IN THE ISO LATIN 1
CHARACTER SET BY THEIR UNACCENTED EQUIVALENT".
> SOME OF THE REPLACEMENTS PERFORMED FOR SCANDINAVIAN
CHARACTERS (USED E.G. IN THE FINNISH, SWEDISH, DANISH
LANGUAGES ETC.) ARE ILLEGAL. THE SCANDINAVIAN CHARACTERS ARE
DIFFERENT FROM THE ACCENTED CHARACTERS USED E.G. IN LATIN
BASED LANGUAGES SUCH AS FRENCH IN THAT THESE CHARACTERS (ä,
ö, å) REPRESENT ENTIRELY INDEPENDENT SOUNDS IN THE LANGUAGE
AND THEREFORE CANNOT BE REPRESENTED WITH ANY OTHER SOUND
WITHOUT CHANGE OF MEANING. IT IS THEREFORE ILLEGAL TO
REPLACE THESE CHARACTERS WITH ANY OTHER CHARACTER.
> THIS MEANS FOR EXAMPLE THAT YOU CAN'T CHANGE THE
FINNISH WORD Sää (WEATHER) TO SAA (WILL HAVE) BECAUSE THESE
ARE TWO ENTIRELY DIFFERENT WORDS WITH DIFFERENT MEANING. THE
SAME APPLIES TO SCANDINAVIAN LANGUAGES AS WELL.
> THERE'S NO CONNECTION BETWEEN THE SOUNDS REPRESENTED BY
ä AND A; ö AND O OR å AND A. 
> IN ADDITION TO THE THREE CHARACTERS MENTIONED ABOVE
DANISH AND NORWEGIAN USE OTHER SPECIAL CHARACTERS SUCH AS ø
AND æ. IT SHOULD BE CHECKED IF THE REPLACEMENT IS LEGAL FOR
THESE CHARACTERS.

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


------------------------------------------------------------
---------
TO UNSUBSCRIBE, E-MAIL: JAVA-DEV-UNSUBSCRIBELUCENE.APACHE.ORG
FOR ADDITIONAL COMMANDS, E-MAIL: JAVA-DEV-HELPLUCENE.APACHE.ORG


Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-15 06:58:50
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029?PAGE=COM.A
TLASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#
ACTION_12534814 ] 

MARK MILLER COMMENTED ON LUCENE-1029:
-------------------------------------

> WITH THE ACCENT FILTER, RUNNING THE SWEDISH WORD
"KöN" THROUGH THE FILTER WOULD CREATE
"KON". THE FIRST MEANS "GENDER" AND THE
SECOND "COW". THAT WOULD NOT BE ACCETABLE.

I AM FEELING LAZY RIGHT NOW, BUT IT SEEMS TO ME YOU COULD
FIND A SIMILAR RARE STEMMING EXAMPLE (EG SOMETHING THAT
MEANS SOMETHING ELSE IN ITS STEMMED FORM). THE PROCESS IS
ALGORITHMIC AFTER ALL, AND THERE ARE MANY LANGUAGE WITH
PLENTY OF WORDS OUT THERE.

REGARDLESS, IT DOESN'T SEEM THIS FILTER CLAIMS IT WILL
MAINTAIN THE MEANING OF "KöN"...RATHER IT WILL
STRIP THE '..' OFF THE TOP OF THE 'O'. ITS A BRUTE FORCE AND
SOMEWHAT DANGEROUS FILTER FROM THE GET GO...STRIPPING
ACCENTS ITS NOT A VALID LANGUAGE OPERATION THAT I KNOW OF.

I'LL LEAVE AT THAT FROM MY SIDE OF THE ARGUMENT <G>
LET THE LUCENE GODS SPEAK.

> ILLEGAL CHARACTER REPLACEMENTS IN
ISOLATIN1ACCENTFILTER
>
-------------------------------------------------------
>
>                 KEY: LUCENE-1029
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029
>             PROJECT: LUCENE - JAVA
>          ISSUE TYPE: BUG
>          COMPONENTS: ANALYSIS
>    AFFECTS VERSIONS: 2.2
>            REPORTER: MARKO ASPLUND
>
> THE ISOLATIN1ACCENTFILTER CLASS IS RESPONSIBLE FOR
REPLACING "ACCENTED CHARACTERS IN THE ISO LATIN 1
CHARACTER SET BY THEIR UNACCENTED EQUIVALENT".
> SOME OF THE REPLACEMENTS PERFORMED FOR SCANDINAVIAN
CHARACTERS (USED E.G. IN THE FINNISH, SWEDISH, DANISH
LANGUAGES ETC.) ARE ILLEGAL. THE SCANDINAVIAN CHARACTERS ARE
DIFFERENT FROM THE ACCENTED CHARACTERS USED E.G. IN LATIN
BASED LANGUAGES SUCH AS FRENCH IN THAT THESE CHARACTERS (ä,
ö, å) REPRESENT ENTIRELY INDEPENDENT SOUNDS IN THE LANGUAGE
AND THEREFORE CANNOT BE REPRESENTED WITH ANY OTHER SOUND
WITHOUT CHANGE OF MEANING. IT IS THEREFORE ILLEGAL TO
REPLACE THESE CHARACTERS WITH ANY OTHER CHARACTER.
> THIS MEANS FOR EXAMPLE THAT YOU CAN'T CHANGE THE
FINNISH WORD Sää (WEATHER) TO SAA (WILL HAVE) BECAUSE THESE
ARE TWO ENTIRELY DIFFERENT WORDS WITH DIFFERENT MEANING. THE
SAME APPLIES TO SCANDINAVIAN LANGUAGES AS WELL.
> THERE'S NO CONNECTION BETWEEN THE SOUNDS REPRESENTED BY
ä AND A; ö AND O OR å AND A. 
> IN ADDITION TO THE THREE CHARACTERS MENTIONED ABOVE
DANISH AND NORWEGIAN USE OTHER SPECIAL CHARACTERS SUCH AS ø
AND æ. IT SHOULD BE CHECKED IF THE REPLACEMENT IS LEGAL FOR
THESE CHARACTERS.

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


------------------------------------------------------------
---------
TO UNSUBSCRIBE, E-MAIL: JAVA-DEV-UNSUBSCRIBELUCENE.APACHE.ORG
FOR ADDITIONAL COMMANDS, E-MAIL: JAVA-DEV-HELPLUCENE.APACHE.ORG


Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-15 07:04:53
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029?PAGE=COM.A
TLASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#
ACTION_12534818 ] 

KARL WETTIN COMMENTED ON LUCENE-1029:
-------------------------------------

>> WITH THE ACCENT FILTER, RUNNING THE SWEDISH WORD
"KöN" THROUGH THE FILTER WOULD 
>> CREATE "KON". THE FIRST MEANS
"GENDER" AND THE SECOND "COW". THAT
WOULD NOT BE ACCETABLE.
>
> I AM FEELING LAZY RIGHT NOW, BUT IT SEEMS TO ME YOU
COULD FIND A SIMILAR RARE STEMMING
> EXAMPLE (EG SOMETHING THAT MEANS SOMETHING ELSE IN ITS
STEMMED FORM). THE PROCESS
> IS ALGORITHMIC AFTER ALL, AND THERE ARE MANY LANGUAGE
WITH PLENTY OF WORDS OUT THERE.

JUST TO POINT OUT, PRETTY MUCH ANY SMALL (LESS THAN SAY 6
LETTERS OR SO) IN SWEDISH CONTAINING å, ä OR ö WOULD GET A
COMPLETE DIFFERENT MEANING IF YOU REPLACE THE LETTERS.





> ILLEGAL CHARACTER REPLACEMENTS IN
ISOLATIN1ACCENTFILTER
>
-------------------------------------------------------
>
>                 KEY: LUCENE-1029
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029
>             PROJECT: LUCENE - JAVA
>          ISSUE TYPE: BUG
>          COMPONENTS: ANALYSIS
>    AFFECTS VERSIONS: 2.2
>            REPORTER: MARKO ASPLUND
>
> THE ISOLATIN1ACCENTFILTER CLASS IS RESPONSIBLE FOR
REPLACING "ACCENTED CHARACTERS IN THE ISO LATIN 1
CHARACTER SET BY THEIR UNACCENTED EQUIVALENT".
> SOME OF THE REPLACEMENTS PERFORMED FOR SCANDINAVIAN
CHARACTERS (USED E.G. IN THE FINNISH, SWEDISH, DANISH
LANGUAGES ETC.) ARE ILLEGAL. THE SCANDINAVIAN CHARACTERS ARE
DIFFERENT FROM THE ACCENTED CHARACTERS USED E.G. IN LATIN
BASED LANGUAGES SUCH AS FRENCH IN THAT THESE CHARACTERS (ä,
ö, å) REPRESENT ENTIRELY INDEPENDENT SOUNDS IN THE LANGUAGE
AND THEREFORE CANNOT BE REPRESENTED WITH ANY OTHER SOUND
WITHOUT CHANGE OF MEANING. IT IS THEREFORE ILLEGAL TO
REPLACE THESE CHARACTERS WITH ANY OTHER CHARACTER.
> THIS MEANS FOR EXAMPLE THAT YOU CAN'T CHANGE THE
FINNISH WORD Sää (WEATHER) TO SAA (WILL HAVE) BECAUSE THESE
ARE TWO ENTIRELY DIFFERENT WORDS WITH DIFFERENT MEANING. THE
SAME APPLIES TO SCANDINAVIAN LANGUAGES AS WELL.
> THERE'S NO CONNECTION BETWEEN THE SOUNDS REPRESENTED BY
ä AND A; ö AND O OR å AND A. 
> IN ADDITION TO THE THREE CHARACTERS MENTIONED ABOVE
DANISH AND NORWEGIAN USE OTHER SPECIAL CHARACTERS SUCH AS ø
AND æ. IT SHOULD BE CHECKED IF THE REPLACEMENT IS LEGAL FOR
THESE CHARACTERS.

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


------------------------------------------------------------
---------
TO UNSUBSCRIBE, E-MAIL: JAVA-DEV-UNSUBSCRIBELUCENE.APACHE.ORG
FOR ADDITIONAL COMMANDS, E-MAIL: JAVA-DEV-HELPLUCENE.APACHE.ORG


Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-15 07:17:51
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029?PAGE=COM.A
TLASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#
ACTION_12534829 ] 

MARKO ASPLUND COMMENTED ON LUCENE-1029:
---------------------------------------

IT'S ALSO VERY EASY TO FIND EXAMPLES IN THE FINNISH LANGUAGE
WHERE THE MEANING OF THE WORD CHANGES WHEN YOU MAKE THE
CHARACTER REPLACEMENTS DONE BY THE FILTER CLASS.

JUST TO GIVE YOU A SOME EXAMPLES:
- Sää (WEATHER) ==> SAA (WILL HAVE)
- PäSSI (GOAT) ==> PASSI (PASSPORT)
...

THE FILTER CLASS JAVADOC SAYS THE FOLLOWING:

"A FILTER THAT REPLACES ACCENTED CHARACTERS IN THE ISO
LATIN 1 CHARACTER SET (ISO-8859-1) BY THEIR UNACCENTED
EQUIVALENT. THE CASE WILL NOT BE ALTERED."

IN MY OPINION CHANGING THE MEANING OF A WORD DOES NOT
QUALIFY AS AN "EQUIVALENT" REPLACEMENT.


> ILLEGAL CHARACTER REPLACEMENTS IN
ISOLATIN1ACCENTFILTER
>
-------------------------------------------------------
>
>                 KEY: LUCENE-1029
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029
>             PROJECT: LUCENE - JAVA
>          ISSUE TYPE: BUG
>          COMPONENTS: ANALYSIS
>    AFFECTS VERSIONS: 2.2
>            REPORTER: MARKO ASPLUND
>
> THE ISOLATIN1ACCENTFILTER CLASS IS RESPONSIBLE FOR
REPLACING "ACCENTED CHARACTERS IN THE ISO LATIN 1
CHARACTER SET BY THEIR UNACCENTED EQUIVALENT".
> SOME OF THE REPLACEMENTS PERFORMED FOR SCANDINAVIAN
CHARACTERS (USED E.G. IN THE FINNISH, SWEDISH, DANISH
LANGUAGES ETC.) ARE ILLEGAL. THE SCANDINAVIAN CHARACTERS ARE
DIFFERENT FROM THE ACCENTED CHARACTERS USED E.G. IN LATIN
BASED LANGUAGES SUCH AS FRENCH IN THAT THESE CHARACTERS (ä,
ö, å) REPRESENT ENTIRELY INDEPENDENT SOUNDS IN THE LANGUAGE
AND THEREFORE CANNOT BE REPRESENTED WITH ANY OTHER SOUND
WITHOUT CHANGE OF MEANING. IT IS THEREFORE ILLEGAL TO
REPLACE THESE CHARACTERS WITH ANY OTHER CHARACTER.
> THIS MEANS FOR EXAMPLE THAT YOU CAN'T CHANGE THE
FINNISH WORD Sää (WEATHER) TO SAA (WILL HAVE) BECAUSE THESE
ARE TWO ENTIRELY DIFFERENT WORDS WITH DIFFERENT MEANING. THE
SAME APPLIES TO SCANDINAVIAN LANGUAGES AS WELL.
> THERE'S NO CONNECTION BETWEEN THE SOUNDS REPRESENTED BY
ä AND A; ö AND O OR å AND A. 
> IN ADDITION TO THE THREE CHARACTERS MENTIONED ABOVE
DANISH AND NORWEGIAN USE OTHER SPECIAL CHARACTERS SUCH AS ø
AND æ. IT SHOULD BE CHECKED IF THE REPLACEMENT IS LEGAL FOR
THESE CHARACTERS.

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


------------------------------------------------------------
---------
TO UNSUBSCRIBE, E-MAIL: JAVA-DEV-UNSUBSCRIBELUCENE.APACHE.ORG
FOR ADDITIONAL COMMANDS, E-MAIL: JAVA-DEV-HELPLUCENE.APACHE.ORG


Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
country flaguser name
United States
2007-10-15 07:29:50
    [
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029?PAGE=COM.A
TLASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#
ACTION_12534839 ] 

DM SMITH COMMENTED ON LUCENE-1029:
----------------------------------

TRANSLITERATION RULES ARE LANGUAGE DEPENDENT. I SUGGEST THAT
THE DOCUMENTATION FOR THE ISOLATIN1ACCENTFILTER BE ADJUSTED
TO MATCH IT'S BEHAVIOR, STATING THAT IT STRIPS DIACRITICS
FROM CHARACTERS AND DOES FURTHER SUBSTITUTIONS (GIVING THE
PRECISE LIST) AND THAT IT DOES NOT DO TRANSLITERATION.
FURTHER GIVE EXAMPLES AS STATED IN THE ABOVE COMMENTS THAT
THE RESULTS FOR SUCH A STRIPPING MAY RESULT IN EXAMPLES THAT
ARE ENTIRELY INAPPROPRIATE.

ICU4J CAN BE USED TO DO PER LANGUAGE TRANSLITERATION.  IIRC,
DEPENDENCY ON THIRD PARTY CODE IS ALLOWED IN CONTRIB. SO, IT
WOULD BE APPROPRIATE FOR SUCH FILTERS TO BE IN CONTRIB.


> ILLEGAL CHARACTER REPLACEMENTS IN
ISOLATIN1ACCENTFILTER
>
-------------------------------------------------------
>
>                 KEY: LUCENE-1029
>                 URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/LUCENE-1029
>             PROJECT: LUCENE - JAVA
>          ISSUE TYPE: BUG
>          COMPONENTS: ANALYSIS
>    AFFECTS VERSIONS: 2.2
>            REPORTER: MARKO ASPLUND
>
> THE ISOLATIN1ACCENTFILTER CLASS IS RESPONSIBLE FOR
REPLACING "ACCENTED CHARACTERS IN THE ISO LATIN 1
CHARACTER SET BY THEIR UNACCENTED EQUIVALENT".
> SOME OF THE REPLACEMENTS PERFORMED FOR SCANDINAVIAN
CHARACTERS (USED E.G. IN THE FINNISH, SWEDISH, DANISH
LANGUAGES ETC.) ARE ILLEGAL. THE SCANDINAVIAN CHARACTERS ARE
DIFFERENT FROM THE ACCENTED CHARACTERS USED E.G. IN LATIN
BASED LANGUAGES SUCH AS FRENCH IN THAT THESE CHARACTERS (ä,
ö, å) REPRESENT ENTIRELY INDEPENDENT SOUNDS IN THE LANGUAGE
AND THEREFORE CANNOT BE REPRESENTED WITH ANY OTHER SOUND
WITHOUT CHANGE OF MEANING. IT IS THEREFORE ILLEGAL TO
REPLACE THESE CHARACTERS WITH ANY OTHER CHARACTER.
> THIS MEANS FOR EXAMPLE THAT YOU CAN'T CHANGE THE
FINNISH WORD Sää (WEATHER) TO SAA (WILL HAVE) BECAUSE THESE
ARE TWO ENTIRELY DIFFERENT WORDS WITH DIFFERENT MEANING. THE
SAME APPLIES TO SCANDINAVIAN LANGUAGES AS WELL.
> THERE'S NO CONNECTION BETWEEN THE SOUNDS REPRESENTED BY
ä AND A; ö AND O OR å AND A. 
> IN ADDITION TO THE THREE CHARACTERS MENTIONED ABOVE
DANISH AND NORWEGIAN USE OTHER SPECIAL CHARACTERS SUCH AS ø
AND æ. IT SHOULD BE CHECKED IF THE REPLACEMENT IS LEGAL FOR
THESE CHARACTERS.

-- 
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.


------------------------------------------------------------
---------
TO UNSUBSCRIBE, E-MAIL: JAVA-DEV-UNSUBSCRIBELUCENE.APACHE.ORG
FOR ADDITIONAL COMMANDS, E-MAIL: JAVA-DEV-HELPLUCENE.APACHE.ORG


[1-10] [11-18]

about | contact  Other archives ( Real Estate discussion Medical topics )