List Info

Thread: UTF/asian charset in urls




UTF/asian charset in urls
country flaguser name
United Kingdom
2007-09-28 03:33:56
Does anyone know what these "$map(%E7%A7%8B)"
things are in URLs when  
asian-UTF characters are used?

Ie, typing into a search form on aolserver (BookMooch) for
the  
japanese character "ç§‹" converts it to

http://bookmooch.com/m/s/$map(%E7%A7%8B)

That $map looks like an array lookup, but I've never seen
any  
documentation about this, or how it might be re-converted
back into  
UTF.  I can't find any mention of $map in the aolserver or
tcl source  
code, nor online.

The other possibility is that the hex codes: %E7%A7%8B are
just 3- 
byte representations of each character, in linear order,
that needs  
to be reconverted back into UTF somehow.

-john


--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to
<listservlistserv.aol.com> with the
body of "SIGNOFF AOLSERVER" in the email message.
You can leave the Subject: field of your email blank.

Re: UTF/asian charset in urls
user name
2007-10-01 02:52:24
I don't see this problem in 4.0 or 4.5.

It looks like searching for $B=)(B on bookmooch first goes to

http://www.bookmooch.com/search?w=%E7%A7%8B&search.x=14&search.y=13

(which looks fine - that's the correct URL encoding of the UTF-8 representation of that character)

but that page immediately redirects to a cleaner search URL ( http://bookmooch.com/m/s/...) which contains the $map(...).  Maybe it's related to the code that builds the cleaner URL?

-Hossein

On 9/28/07, John Buckman < johnmagnatune.com">johnmagnatune.com> wrote:
Does anyone know what these "$map(%E7%A7%8B)" things are in URLs when
asian-UTF characters are used?

Ie, typing into a search form on aolserver (BookMooch) for the
japanese character "$B=)(B" converts it to

http://bookmooch.com/m/s/$map(%E7%A7%8B)

That $map looks like an array lookup, but I've never seen any
documentation about this, or how it might be re-converted back into
UTF. &nbsp;I can't find any mention of $map in the aolserver or tcl source
code, nor online.

The other possibility is that the hex codes: %E7%A7%8B are just 3-
byte representations of each character, in linear order, that needs
to be reconverted back into UTF somehow.

-john


--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to < listservlistserv.aol.com">listservlistserv.aol.com> with the
body of "SIGNOFF AOLSERVER&quot; in the email message. You can leave the Subject: field of your email blank.

-- AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to <listservlistserv.aol.com> with the body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: field of your email blank.

Re: UTF/asian charset in urls
country flaguser name
United Kingdom
2007-10-02 09:46:20
ON OCT 1, 2007, AT 8:52 AM, HOSSEIN SHARIFI WROTE:
I DON'T SEE THIS PROBLEM IN 4.0 OR 4.5.
IT LOOKS LIKE SEARCHING FOR ?‹ ON BOOKMOOCH FIRST GOES TO
HTTP://WWW.BOOKMOOCH.COM/SEARCH?W=%E7%A7%8B&SEARCH.X=14&SEARCH.Y=13
(WHICH LOOKS FINE - THAT'S THE CORRECT URL ENCODING OF THE UTF-8 REPRESENTATION OF THAT CHARACTER)
BUT THAT PAGE IMMEDIATELY REDIRECTS TO A CLEANER SEARCH URL ( HTTP://BOOKMOOCH.COM/M/S/...) WHICH CONTAINS THE $MAP(...).  MAYBE IT'S RELATED TO THE CODE THAT BUILDS THE CLEANER URL?

THANKS HOSSEIN, FOR THE INSIGHT. 

YOU WERE RIGHT, MY PROBLEM WAS CAUSED BY A BUG IN THE NCGI URL ENCODING FUNCTION. 

NCGI BUILDS A $MAP() ARRAY OF CHARACTER CONVERSIONS, AND PUTS $MAP(CHARACTER) AROUND ANYTHING THAT ISN'T A-ZA-Z0-9, THEN A [SUBST -NOCOMMAND $STRING] AROUND ALL THAT.  

THE HIGHER-UTF CHARACTERS CAUSE PROBLEMS WITH THE NCGI FUNCTION, AND THEY EMERGE FROM IT WITH $MAP() WRAPPED AROUND THEM.  A SIMPLE REGEXP TO REMOVE $MAP() FROM WHAT NCGI CAN'T ENCODE, AND NOW IT WORKS PERFECTLY.

I HAVE TO SAY THAT THE AOLSERVER HANDLING OF UTF IS REALLY WELL DONE. 

AT LYRIS, WE NEVER DID QUITE GET THE UTF HANDLING IN TCLHTTPD DONE PERFECTLY, THERE WERE STILL SOME FRINGE CASES THAT CAUSED GARBLING.  

WITH AOLSERVER, EXCEPT FOR THIS NCGI PROBLEM, AND FIGURING OUT THAT I NEEDED TO SWITCH TO UTF-8 AS THE DEFAULT (FROM THE DEFAULT ISO 8859-1), NON-ENGLISH CHARACTER SETS HAVE WORKED PERFECTLY.

-JOHN

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )