List Info

Thread: Re: RSS and diacritics




Re: RSS and diacritics
country flaguser name
United States
2007-11-27 14:34:37

---- Original message ----
>Date: Tue, 27 Nov 2007 14:56:56 -0500
>From: Bob Duncan <duncanrlafayette.edu>  
>Subject: [Web4lib] RSS and diacritics  
>To: web4libwebjunction.org
>
>
>Greetings,
>
>I'm getting ready to offer RSS feeds for our library's
recent 
>acquisitions lists and have run into a little snag: 
characters with 
>diacritics.  I understand why I can't use HTML character
entity 
>references and expect all feed readers to play nicely,
so I tried 
>encoding the ampersand in the HTML entity reference (a
suggested fix 
>that I can no longer document).  While this works great
for some feed 
>readers, other readers and the two major browsers
display the raw 
>code instead of the character with diacritical mark.
>
>Other than displaying plain letters without diacritics,
is there a 
>way to code feeds so that all (or at least most) feed
readers will 
>display the character with the mark?  (I'd like to be
able to this in 
>item titles and descriptions.)
>
>Thanks,
>

I guess I'm a little confused.  This could possibly be
several problems and there's a lot more we need to know. 
Where are you getting your information from that has
diacritics?  What encoding are those diacritics?  Are you
sure the data isn't being converted or corrupted when you
are querying the source?

RSS feeds are XML.  If you're pulling unicode information
and putting it directly into the RSS feed and the RSS feed's
encoding matches, you shouldn't have an issue.  The
diacritics will be there.

That being said, unicode isn't very well supported as of
yet.  There's a lot of software and fonts that don't have
very complete character sets.  Arial Unicode so far has the
most complete that I know of.  People using a browser will
have to have it set to use a unicode font to see unicode
characters correctly.  On top of that, there's a lot of
software that mishandles combining diacritics (IE 6 is one
example, if I recall correctly) and will never display them
correctly.

Other issues like bi-directionality are ambiguous and not
clear even now.  For example, if you have Korean and English
in one document, it's not clear what layer of the software
is required to do the work necessary so each can be read in
the right direction.

Unicode issues can run through several layers of software,
even for the server-side software that is commonly used for
generating things like RSS feeds.  Often unicode support is
feasible, but it must be done purposefully and it's not.

Unicode issues can be tricky, but you should be able to
trace the data through the system and ensure that it's
unicode at every step.

Of course, if the source data isn't even in unicode, that's
another issue.  

Jon Gorman   
_______________________________________________
Web4lib mailing list
Web4libwebjunction.org
http://lists.we
bjunction.org/web4lib/

Re: RSS and diacritics
country flaguser name
Australia
2007-11-27 18:09:30

Jonathan Gorman wrote:
> There's a lot of software and fonts that don't have
very complete character sets.  Arial Unicode so far has the
most complete that I know of. People using a browser will
have to have it set to use a unicode font to 
see unicode characters correctly.  On top of that, there's a
lot of 
software that mishandles combining diacritics (IE 6 is one
example, if I 
recall correctly) and will never display them correctly.
> 

There are a few common misconceptions here.

all modern web browsers are Unicode based at the core, older
8-bit 
legacy encodings are supported by transcoding to Unicode on
the fly

This has been the case since Netscape 4 and IE 3/4

All core operating system fonts (Windows and MacOS are
Unicode based) 
even core fonts on Windows 98.

There are no pan unicode fonts. There are too many
characters in unicode 
to be able to have a single font support them. Fonts have
physical limits.

Arial Unicode MS only supports a very old version of
Unicode, and that 
incompletely. It is useful for characters with diacritics
when those 
characters are precomposed characters. It is not suitable
for combining 
diacritics. It doesn't have the required mark and mkmk
OpenType features 
for the Latin script.

Combining diacritic support on the Windows platform
requires:

1) an appropriate font, and
2) an appropriate font rendering system

For Windows this means:

a) using Windows Vista, or
b) using Windows XP (Service Pack 2) and installing an
appropriate font. 
There are a small number of fonts available and enabling
complex script 
support.

IE6 will display combining diacrtics correctly on Windows XP
SP2 (with 
complex script support enabled) and if you are using an
appropriate 
font, e.g. Doulos SIL, Charis SIL, the Gentium Book beta, 
African/Aboriginal Sans , African/Aboriginal Serif, Code
2000, and 
possibly the latest DejaVu fonts, etc..


> Other issues like bi-directionality are ambiguous and
not clear even now.  For example, if you have Korean and
English in one document, it's not clear what layer of the
software is required to do the work necessary so each can be
read in the right direction.
> 

Korean doesn't require bidi support. I think you are
thinking of 
vertical text layout here, not bidi support.

Also in XML a schema or DTD should define mechanisms for
handling bdi 
support or should reference ITS namespace. The RSS
schemas/DTD do not. 
Lack of bidi support in RSS has been a long standing issue.


-- 
Andrew Cunningham
Research and Development Coordinator (Vicnet)
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Email: andrewc+AEA-vicnet.net.au
Alt. email: lang.support+AEA-gmail.com

Ph: +613-8664-7430                    Fax:+613-9639-2175
Mob: 0421-450-816

http://www.slv.vic.gov.au/
            http://www.vicnet.net.au/
http://www.openroad.net.a
u/           http://www.mylanguage.g
ov.au/
http://home.vicne
t.net.au/~andrewc/
_______________________________________________
Web4lib mailing list
Web4libwebjunction.org
http://lists.we
bjunction.org/web4lib/

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )