List Info

Thread: Re: RSS and diacritics




Re: RSS and diacritics
country flaguser name
United States
2007-11-27 21:52:52
>> There's a lot of software and fonts that don't have
very complete character sets.  Arial Unicode so far has the
most complete that I know of. People using a browser will
have to have it set to use a unicode font to 
>see unicode characters correctly.  On top of that,
there's a lot of 
>software that mishandles combining diacritics (IE 6 is
one example, if I 
>recall correctly) and will never display them
correctly.
>> 
>
>There are a few common misconceptions here.
>
>all modern web browsers are Unicode based at the core,
older 8-bit 
>legacy encodings are supported by transcoding to Unicode
on the fly
>
>This has been the case since Netscape 4 and IE 3/4
>

Well, yes, but if the font they're using doesn't have
anything beyond the basic ascii mapping, they're not what I
would call unicode compatible ;).  In my defense  I wasn't
saying browsers have issues, but there's a lot of software
that does.    

IE 6 for a while in a default configuration did have several
bugs relating to unicode.  Heck, there's till a lot of
programming languages out there that have horrible unicode
support.  I was flabbergasted a few years ago to see how
poor the support was in Ruby, for crying out loud.

>All core operating system fonts (Windows and MacOS are
Unicode based) 
>even core fonts on Windows 98.
>

Well, I'm not an expert on these things.  My main goal was
to advise someone having trouble with seeing characters
appear on the webpage to use by default a font that had the
widest implemented character set.  I chose the font that
came to mind, which is probably out of date ;).  I didn't
think that Windows 98 core fonts, once you got out of the
ascii range, were very compatible, but I don't pay much
attention to these things.  Neither do the people who set
all their default fonts too...*shudders* comic sans.

>There are no pan unicode fonts. There are too many
characters in unicode 
>to be able to have a single font support them. Fonts
have physical limits.
>

I guess I'm confused here.  How does a font have a physical
limit?  While certainly a daunting task, I would think it's
certainly possible someone could come up with a font that
has all the unicode characters that are currently speced
out.  (True, there's a lot of unassigned ones, but what's
the point of debating about that). 

>Arial Unicode MS only supports a very old version of
Unicode, and that 
>incompletely. It is useful for characters with
diacritics when those 
>characters are precomposed characters. It is not
suitable for combining 
>diacritics. It doesn't have the required mark and mkmk
OpenType features 
>for the Latin script.
>
>Combining diacritic support on the Windows platform
requires:
>
>1) an appropriate font, and
>2) an appropriate font rendering system
>
>For Windows this means:
>
>a) using Windows Vista, or
>b) using Windows XP (Service Pack 2) and installing an
appropriate font. 
>There are a small number of fonts available and enabling
complex script 
>support.


I guess I'm confused.  So all core fonts since Windows 98
are unicode fonts, just as long as you don't expect them to
do unicodish things like combining diacritics?  Then you
need a new OS?  I don't quite get what you're saying.  

>
>IE6 will display combining diacrtics correctly on
Windows XP SP2 (with 
>complex script support enabled) and if you are using an
appropriate 
>font, e.g. Doulos SIL, Charis SIL, the Gentium Book
beta, 
>African/Aboriginal Sans , African/Aboriginal Serif, Code
2000, and 
>possibly the latest DejaVu fonts, etc..
>

Thanks ;).  I'm planning on poking at some of these fonts. 
I've been looking for a replacement for Arial Unicode for a
while now.  Sadly, a great deal of our patrons and our other
folks aren't going to be installing other fonts, so we're
stuck with trying to choose fonts they might have from
installing something like Word.  

>
>> Other issues like bi-directionality are ambiguous
and not clear even now.  For example, if you have Korean and
English in one document, it's not clear what layer of the
software is required to do the work necessary so each can be
read in the right direction.
>> 
>
>Korean doesn't require bidi support. I think you are
thinking of 
>vertical text layout here, not bidi support.
>

Ya caught me, thinking about two different issues and gave a
vertical layout language issue when I was thinking about a
bi-di one.  (Although, if I remember correctly Korean is
read from top to bottom, right to left.  Not sure that that
problem is called.  Over Under Sideways Downt (OUSD?).  In
any case, directionality can be a pain. 

>Also in XML a schema or DTD should define mechanisms
 for handling bdi 
>support or should reference ITS namespace. The RSS
schemas/DTD do not. 
>Lack of bidi support in RSS has been a long standing
issue.
>

Well, I'd argue it's not clear to me that it's necessarily
the schema that should defining the mechanisms.  After all,
that's a lower-level interaction I'd like to see remain
constant despite changes in schema or in the absence of one.
 My editor should know how to handle it, regardless of the
XML or if I'm editing XML.  But I'm not really an expert in
these things, just trying to give what practical advice I
can. Right now in practice this seems to be an issue whether
there is a schema or not.


Thanks for some of the tips, but I'm not really sure if what
I had were misconceptions.  I may have made some
generalizations, but that's because these issues really are
too complex to address in this forum.  (I also admit to
being somewhat more sloppy in my phrasing when trying to
help someone as opposed to just musing on a concept).  

In practice when advising some people on various unicode
issues I've found myself giving the following advice:

1) Be aware all layers of software can be prone to having
different issues or configuration needs with unicode.  Make
sure you're passing along the encoding you intend and some
over-zealous piece of software isn't attempting to map what
it thinks is MARC-8 to some ancient Swahili character set.

2) Make sure you can actually view the file you're looking
at with the font you have.  Depressing number of people have
said "There's something wrong with this file" when
reality was "My font can't display this character, so
it's showing this cute little box".

3) Try to avoid combining diacritics. 

4) Software lags several years behind changes to the unicode
standards, probably because many people are still trying to
understand the old ones ;).  See rule 3.

5) There's a lot of issues that don't seem clear.  Where
should bi-di issues be addressed?  Is fancy bred in the
heart or in the head?

And on that note, I've talked too long.  

Jon
_______________________________________________
Web4lib mailing list
Web4libwebjunction.org
http://lists.we
bjunction.org/web4lib/

Re: RSS and diacritics
country flaguser name
Australia
2007-11-28 01:19:17

Jonathan Gorman wrote:

> Well, yes, but if the font they're using doesn't have
anything beyond the basic ascii mapping, they're not what I
would call unicode compatible ;).  In my defense  I wasn't
saying browsers have issues, but there's a lot of software
that does.    

Core windows fonts were designed to support WGL 4.0 which
was seen as a 
necessary subset of Unicode to meet the needs of European
languages. 
Documentation should be available on the Microsoft
Typography site.

The reality is that fonts are developed for specific sets of
languages 
and scripts. From the point of typographic design it isn't
desirable to 
mix scripts. it is a fine are to design glyphs for one
script that can 
harmonise with another script without distorting the
scripts.

I'd draw a distinction between:

1) software that is not internationalised or the developers
made a mess of;
2) software based on the windows 95 internationalization
model (i.e. 
remapping Unicode to Windows codepages) although microsoft
itself moved 
away form this model with web browsing technology. One
languages 
directly supported by code pages are su[pported by this
model.
3) windows 2000 internationalization model, which is unicode
at the core 
and maps unicode to code pages.

As you indicate, there is a lot of badly written code out
there. But the 
approaches to handling  Unicode have been around for many
years. We've 
gone through various interactions of operating systems and
applications 
that are Unicode based. And there are only a limited number
of languages 
  or scripts that are problematic or difficult now.

Personally, I'm waiting for Mon, S'gaw Karen, Cham and
Viet-Tai support, 
and currently testing a Mon and S;'gaw karen Unicode 5.1
beta solution.

Most languages are so straight forward and easy these days,
esp on the web.

> IE 6 for a while in a default configuration did have
several bugs relating to unicode.  Heck, there's till a lot
of programming languages out there that have horrible
unicode support.  I was flabbergasted a few years ago to see
how poor the support was in Ruby, for crying out loud.

PHP 4 and 5 are even worse

and I will not discuss the warped Perl character model.

> 
> Well, I'm not an expert on these things.  My main goal
was to advise someone having trouble with seeing characters
appear on the webpage to use by default a font that had the
widest implemented character set.  I chose the font that
came to mind, which is probably out of date ;).  I didn't
think that Windows 98 core fonts, once you got out of the
ascii range, were very compatible, but I don't pay much
attention to these things.  Neither do the people who set
all their default fonts too...*shudders* comic sans.
> 

the approach we take is the opposite, we follow w3c
internationalisation 
best practice, tag language and language change (also
required for WCAG 
1.0 compliance). We also use language specific styling, so
that 
different languages would use the most appropriate fonts for
that 
language/script.

You are also forgetting font linking technologies built into
the web 
browser and the font rendering system. The only time there
are problems 
is when the web page or site style sheet actually ussues the
wrong fonts 
or the fonts don't support necessary languages.

On of the reasons when we use hotamil, yahoo or gmail, we
use firefox, 
with stylish and override the fonts used on those sites so
we can 
selectively use more appropriate fonts to display certain
languages, esp 
African languages.

Also font coverage on each version of Windows is different
because each 
veerion of windows supports a different range of languages.
Vista has 
Khmer and Lao fonts by default, but you will not find any
shipped on 
older versions of windows.

comes down to the web developer and programmers doing a good
web 
internationalisation job.

>> There are no pan unicode fonts. There are too many
characters in unicode 
>> to be able to have a single font support them.
Fonts have physical limits.
>>
> 
> I guess I'm confused here.  How does a font have a
physical limit?  While certainly a daunting task, I would
think it's certainly possible someone could come up with a
font that has all the unicode characters that are currently
speced out.  (True, there's a lot of unassigned ones, but
what's the point of debating about that). 

There are a limited number of glyphs that can be contained
in a TrueType 
font, i.e. 65536 glyphs. So to support all existing CJK
ideographs, 
you'd need two fonts.

Even a script like Devanagri which has a limited number of
characters, 
requires thousands of additional glyphs to support necessary
ligatures 
and conjuncts.

A Urdu nastalique opentype font could max out the available
glyphs and 
processing instructions just for one language.

The current trend on windows is to make script specific
fonts which may 
not even include the basic Latin characters, and different
UI fonts for 
different scripts.

>> Arial Unicode MS only supports a very old version
of Unicode, and that 
>> incompletely. It is useful for characters with
diacritics when those 
>> characters are precomposed characters. It is not
suitable for combining 
>> diacritics. It doesn't have the required mark and
mkmk OpenType features 
>> for the Latin script.
>>
>> Combining diacritic support on the Windows platform
requires:
>>
>> 1) an appropriate font, and
>> 2) an appropriate font rendering system
>>
>> For Windows this means:
>>
>> a) using Windows Vista, or
>> b) using Windows XP (Service Pack 2) and installing
an appropriate font. 
>> There are a small number of fonts available and
enabling complex script 
>> support.
> 
> 
> I guess I'm confused.  So all core fonts since Windows
98 are unicode fonts, just as long as you don't expect them
to do unicodish things like combining diacritics?  Then you
need a new OS?  I don't quite get what you're saying.  
> 

To do combining diacritics needs not just a font, but also a
font 
rendering system that knows how to use the information in
the font. On 
windows this is Uniscribe. Different versions of windows
have different 
versions fo uniscribe. Over time Microsoft add more support.
The first 
versions of uniscribe to shift with combining diacritic
support for 
Latin and Cyrillic script was the versions in Office 2003
(local copy, 
not system copy) and Windows XP Service Pack 2. But no fonts
were 
shipped. had to use third party fonts. And for many
langauges this is 
necessary.

You also ahve to enable complex script support for Windows
XP, since it 
doesn't use uniscribe by default, unless you've enabled the
RTL and 
Complex script support.

I.e. latin script needs to be treated as a complex script
rather than as 
a non-complex script.

Vista was the first version of windows to ship with
appropriate fonts. 
Currently the new versions of the old core fonts. and the
new UI font.

The only combining diacritics that wrk on older versions of
windows are 
those that belong to the repertoire Microsoft use for
Vietnamese 
support, and will only work with fonts that have Windows
1258 support 
built in. But these use GSUB ratrher than GPOS tables in the
fonts if 
memory serves me correctly.



>> IE6 will display combining diacrtics correctly on
Windows XP SP2 (with 
>> complex script support enabled) and if you are
using an appropriate 
>> font, e.g. Doulos SIL, Charis SIL, the Gentium Book
beta, 
>> African/Aboriginal Sans , African/Aboriginal Serif,
Code 2000, and 
>> possibly the latest DejaVu fonts, etc..
>>
> 
> Thanks ;).  I'm planning on poking at some of these
fonts.  I've been looking for a replacement for Arial
Unicode for a while now.  Sadly, a great deal of our patrons
and our other folks aren't going to be installing other
fonts, so we're stuck with trying to choose fonts they might
have from installing something like Word.  

best choice is to choose fonts that ship with international
English 
versions of windows. And avoid the one font fits all
approach.


> In practice when advising some people on various
unicode issues I've found myself giving the following
advice:
> 
> 1) Be aware all layers of software can be prone to
having different issues or configuration needs with unicode.
 Make sure you're passing along the encoding you intend and
some over-zealous piece of software isn't attempting to map
what it thinks is MARC-8 to some ancient Swahili character
set.

I'd add that you also need to be very specific about wish
parts of 
Unicode you need. I find that most vendors claim to support
Unicode, and 
they do, but usually a very small subset. Unicode doesn't
require 
supporting everything. So its important to know what you
need and 
specify it.

> 2) Make sure you can actually view the file you're
looking at with the font you have.  Depressing number of
people have said "There's something wrong with this
file" when reality was "My font can't display this
character, so it's showing this cute little box".

yep

> 3) Try to avoid combining diacritics. 
> 

nothing wrong with combining diacritics, but for most
languages 
libraries need, combining diacritics aren't necessary,
although there 
are lots of languages where there is no choice, the language
needs 
combining diacritics.

The main problem is that w3c has always recommended that
Unicode web 
pages use Unicode Normalization Form C (NFC) but vendors
don't bother 
normalising data before display. If they used NFC then you
wouldn't need 
to worry about most combining diacritics at the web ned.

Still a problem at the cataloguing stage, but if cataloguing
tools are 
based on the win2000/XP internationalization model the
clients will work 
well.

> 4) Software lags several years behind changes to the
unicode standards, probably because many people are still
trying to understand the old ones ;).  See rule 3.

maybe why I'm leaning more towards Linux/Gnome with a
graphite enabled 
version of pango. Does my heart good to see Myanmar
displayed the way it 
should.

> 5) There's a lot of issues that don't seem clear. 
Where should bi-di issues be addressed?  Is fancy bred in
the heart or in the head?

no that would take w2ay to much time to dissect.

-- 
Andrew Cunningham
Research and Development Coordinator (Vicnet)
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Email: andrewc+AEA-vicnet.net.au
Alt. email: lang.support+AEA-gmail.com

Ph: +613-8664-7430                    Fax:+613-9639-2175
Mob: 0421-450-816

http://www.slv.vic.gov.au/
            http://www.vicnet.net.au/
http://www.openroad.net.a
u/           http://www.mylanguage.g
ov.au/
http://home.vicne
t.net.au/~andrewc/
_______________________________________________
Web4lib mailing list
Web4libwebjunction.org
http://lists.we
bjunction.org/web4lib/

Re: RSS and diacritics
country flaguser name
United States
2007-11-29 07:56:02
Jonathan Gorman wrote:
>> all modern web browsers are Unicode based at the
core,...
>>
>>     
>
> Well, yes, but if the font they're using doesn't have
anything beyond the basic ascii mapping, they're not what I
would call unicode compatible...

The more adept browsers out there figured this out quite a
while ago.  
If the font they're using doesn't have a glyph for the
character 
requested, they pull the correct glyph from a font that does
have it.  
Awkwardly, there's a less adept browser that fails to do
this, that has 
about 80% market share...

CSS2 requires that browsers work their way down the list of
specified 
fonts to find the right glyph, not just find a matching font
name.  
IIRC, Gecko-based browsers and Opera go beyond that to find
any system 
font with the right glyph.


-- 
Thomas Dowling
tdowlingohiolink.edu

_______________________________________________
Web4lib mailing list
Web4libwebjunction.org
http://lists.we
bjunction.org/web4lib/

Re: RSS and diacritics
country flaguser name
United States
2007-11-29 08:42:05
On Thu, 29 Nov 2007, Thomas Dowling wrote:

> The more adept browsers out there figured this out
quite a while ago.  If the
> font they're using doesn't have a glyph for the
character requested, they pull
> the correct glyph from a font that does have it. 
Awkwardly, there's a less
> adept browser that fails to do this, that has about 80%
market share...
> 
> CSS2 requires that browsers work their way down the
list of specified fonts to
> find the right glyph, not just find a matching font
name.  IIRC, Gecko-based
> browsers and Opera go beyond that to find any system
font with the right
> glyph.

As an aside, that is precisely the approach taken by Anzio,
our terminal 
emulation package, and Print Wizard, our printing utility.
These programs 
also take many steps to handle combining diacritics well,
including 
raising the "above" diacritics where necessary to
avoid collision with the 
base character.

My perception of the most common issues in regards to
library systems 
displaying (and printing) diacritics and non-Latin
characters:

1) Very few fonts have the combining double tilde and
combining double 
ligature marks, used mostly with transliterated Russian.

2) Software does not correctly combine combining diacritics.


3) Fonts are inconsistent in the way they specify the
X-location of 
combining diacritics.

4) Library software I have worked with does not give the
browsers 
information about the language contained in a particular
section of text. 
Thus the browser can not take advantage of the user's
language-specific 
font preferences. This is especially a problem in rendering
Han 
characters, which could be part of a Japanese, Korean,
Simplified Chinese, 
or Traditional Chinese title, for instance. With IE, this
seems to force 
the user to use one super-font, which inevitably has
shortcomings.

Finally, Andrew Cunningham mentioned Font Linking. According
to MS's 
documentation, this should make it possible to define a
large virtual font 
by linking together multiple fonts, without physically
combining the 
files. So theoretically I could create a font with the
missing ligature 
marks (see 1 above), and link it to Arial Unicode, for
instance. However, 
I have never succeeded in this in regards to IE. Has anyone
succeeded in 
doing this?

Regards,
....Bob Rasmussen,   President,   Rasmussen Software, Inc.

personal e-mail: rasanzio.com
 company e-mail: rsianzio.com
          voice: (US) 503-624-0360 (9:00-6:00 Pacific Time)
            fax: (US) 503-624-0760
            web: http://www.anzio.com
_______________________________________________
Web4lib mailing list
Web4libwebjunction.org
http://lists.we
bjunction.org/web4lib/

Re: RSS and diacritics
country flaguser name
Australia
2007-11-29 15:28:05

Thomas Dowling wrote:
> Jonathan Gorman wrote:
.
> 
> CSS2 requires that browsers work their way down the
list of specified 
> fonts to find the right glyph, not just find a matching
font name.  
> IIRC, Gecko-based browsers and Opera go beyond that to
find any system 
> font with the right glyph.
> 


not that simple. When using combining diacritics you need to
treat Latin 
script as a complex script.

choping and changing fonts is more likely to break complex
rendering.

And such an approach assumes that each codepoint is
represented by a 
single glyph. The reality in some OpenType fonts is that
each codepoint 
may have multiple glyphs, one of which is a default.

And all this is irrelevant. If the web developer wrote the
page 
properly, then appropriate fonts would be referenced and if
necessary 
help or support files would point to none core fonts
required.

The ransom note effect in gecko browsers shouldn't be
necessary.


just my two cents worth, although that's no longer legal
tender here ;)


As far as i'm concerned we're talking about poor web 
internationalization and poor web design practice. The weak
point has 
been and remains at the vendors/servers end, not the web
clients end.


Andrew

-- 
Andrew Cunningham
Research and Development Coordinator (Vicnet)
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Email: andrewc+AEA-vicnet.net.au
Alt. email: lang.support+AEA-gmail.com

Ph: +613-8664-7430                    Fax:+613-9639-2175
Mob: 0421-450-816

http://www.slv.vic.gov.au/
            http://www.vicnet.net.au/
http://www.openroad.net.a
u/           http://www.mylanguage.g
ov.au/
http://home.vicne
t.net.au/~andrewc/
_______________________________________________
Web4lib mailing list
Web4libwebjunction.org
http://lists.we
bjunction.org/web4lib/

[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )