List Info

Thread: Re: Unicode String in a GeoTiff file




Re: Unicode String in a GeoTiff file
country flaguser name
United States
2007-10-18 09:49:17
On Mon, 2007-05-14 at 17:09 -0400, Ken Garner wrote:

>[...]
> I am creating a GeoTiff file for a customer who
requires that Unicode
> text strings be stored in the file.  The text strings
consist of
> Japanese Kanji characters.  Therefore, an ASCII string
will not
> suffice.
> 

Actually, an ASCII string *will* suffice. The unicode UTF-8
or UTF-7 
encoding standards were designed to cleanly mesh with
existing
ASCII data streams and networks so that the millions of
32-bit
unicode code-points will survive a round-trip through the
internet.
Most XML, for example, specifies a UTF-8 encoding.

Here is a link to a general discussion of the various UTF
unicode
encodings, the BOM, and the advantages of each:

http://unicode.or
g/faq/utf_bom.html

This is really a more general issue about TIFF files, since
all the
information in a GeoTIFF file relies on the TIFF data
encoding
standard.

The "philosophy of TIFF" is, when possible, make
your data as obvious
and accessible as possible, even if the recipient may know
nothing
about your implementation of data encoding, and only has the
TIFF
format spec on hand. We followed this philosophy as much as

we could in designing GeoTIFF.

At any rate, here is my own take on the issue of Unicode in
TIFF:

UTF-8  encoding has the enormous advantage over
"double-byte" UTF-16 or
raw UTF-32 encodings in that the encoding of standard
low-ASCII (seven-bit) 
looks  identical to standard ASCII. UTF-8 encodings of
non-ASCII data
show up as variable-length byte substrings, which are
uniquely distinguishable
from the other ASCII data in which they are embedded. UTF-8
does not
require the use of a NULL (0) byte, which is always
troublesome for TIFF
data readers, even though the spec allows them as
delimiters.

About the only thing that may need discussion is whether or
not to
prepend the Byte-Order-Marker (BOM) at the beginning of the
encoded string.
This marker, which has been used as the
"indicator" of a Unicode string,
is generally considered optional, but unicode-savvy data
readers
are required to make note of and ignore the marker when
read.

--Niles Ritter (author, GeoTIFF 1.0 standard)


_______________________________________________
Geotiff mailing list
Geotifflists.maptools.org
ht
tp://lists.maptools.org/mailman/listinfo/geotiff

RE: Unicode String in a GeoTiff file
country flaguser name
United States
2007-10-18 10:24:58
Niles,

Thank you for the detailed response.  The method you suggest
is the
method I used to resolve this issue; that is, I used UTF-8
encoding to
support both standard ASCII and Japanese Kanji.

Best Regards,
Ken


-----Original Message-----
From: Niles Ritter [mailto:ritterearthlink.net] 
Sent: Thursday, October 18, 2007 10:49 AM
To: Garner, Ken  KLEIN
Cc: geotifflists.maptools.org
Subject: Re: [Geotiff] Unicode String in a GeoTiff file

On Mon, 2007-05-14 at 17:09 -0400, Ken Garner wrote:

>[...]
> I am creating a GeoTiff file for a customer who
requires that Unicode
> text strings be stored in the file.  The text strings
consist of
> Japanese Kanji characters.  Therefore, an ASCII string
will not
> suffice.
> 

Actually, an ASCII string *will* suffice. The unicode UTF-8
or UTF-7 
encoding standards were designed to cleanly mesh with
existing
ASCII data streams and networks so that the millions of
32-bit
unicode code-points will survive a round-trip through the
internet.
Most XML, for example, specifies a UTF-8 encoding.

Here is a link to a general discussion of the various UTF
unicode
encodings, the BOM, and the advantages of each:

http://unicode.or
g/faq/utf_bom.html

This is really a more general issue about TIFF files, since
all the
information in a GeoTIFF file relies on the TIFF data
encoding
standard.

The "philosophy of TIFF" is, when possible, make
your data as obvious
and accessible as possible, even if the recipient may know
nothing
about your implementation of data encoding, and only has the
TIFF
format spec on hand. We followed this philosophy as much as

we could in designing GeoTIFF.

At any rate, here is my own take on the issue of Unicode in
TIFF:

UTF-8  encoding has the enormous advantage over
"double-byte" UTF-16 or
raw UTF-32 encodings in that the encoding of standard
low-ASCII
(seven-bit) 
looks  identical to standard ASCII. UTF-8 encodings of
non-ASCII data
show up as variable-length byte substrings, which are
uniquely
distinguishable
from the other ASCII data in which they are embedded. UTF-8
does not
require the use of a NULL (0) byte, which is always
troublesome for TIFF
data readers, even though the spec allows them as
delimiters.

About the only thing that may need discussion is whether or
not to
prepend the Byte-Order-Marker (BOM) at the beginning of the
encoded
string.
This marker, which has been used as the
"indicator" of a Unicode string,
is generally considered optional, but unicode-savvy data
readers
are required to make note of and ignore the marker when
read.

--Niles Ritter (author, GeoTIFF 1.0 standard)

_______________________________________________
Geotiff mailing list
Geotifflists.maptools.org
ht
tp://lists.maptools.org/mailman/listinfo/geotiff

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )