|
List Info
Thread: printing wchar_t*
|
|
| printing wchar_t* |

|
2006-04-14 08:30:59 |
> From: Vladimir Prus <ghost cs.msu.su>
> Date: Fri, 14 Apr 2006 10:01:57 +0400
>
> > What character set is used by the wide characters
in the wchar_t
> > arrays? GDB has some support for a few
single-byte character sets,
> > see the node "Character Sets" in the
manual.
>
> Relatively safe bet would be to assume it's some
zero-terminated character
> set. I plan to assume it's either UTF-16 or UTF-32 in
the GUI (the
> conversion code is the same for both encodings), but
gdb can just print raw
> values.
We should get our terminology right: UTF-16 is not a
character set,
it's an encoding (and a multibyte encoding, btw). As for
UTF-32, I
don't think such a beast exists at all.
I think you meant 16-bit Unicode characters (a.k.a. the BMP)
and
32-bit Unicode characters, respectively.
> > It's one possibility, the other one being to call
a function in the
> > debuggee to produce the string.
>
> And what such a function will return? char* in local
8-bit encoding? In that
> case, no all wchar_t* variable can be printed.
If you want to display non-ASCII strings, it means you
already have
some way of displaying such characters. The function I
mentioned
would not return anything, it would actually _display_ the
string.
For example, in command-line version of GDB, if the terminal
supports
UTF-8 encoded characters, that function would output a UTF-8
encoding
of the non-ASCII string, and then the terminal will display
them with
the correct glyphs.
> > Yet another possibility is to do the
> > conversion in your GUI front end.
>
> That's what I'm going to do, but first I need to get
raw data, preferrably
> without issing an MI command for every single
character.
A wchar_t string is just an array, and GDB already has a
feature to
produce N elements of an array. In CLI, you say
"print *array 20" to
print the first 20 elements of the named array.
|
|
| printing wchar_t* |

|
2006-04-14 08:46:57 |
On Friday 14 April 2006 12:30, Eli Zaretskii wrote:
> > Relatively safe bet would be to assume it's some
zero-terminated
> > character set. I plan to assume it's either
UTF-16 or UTF-32 in the GUI
> > (the conversion code is the same for both
encodings), but gdb can just
> > print raw values.
>
> We should get our terminology right: UTF-16 is not a
character set,
> it's an encoding (and a multibyte encoding, btw). As
for UTF-32, I
> don't think such a beast exists at all.
>
> I think you meant 16-bit Unicode characters (a.k.a. the
BMP) and
> 32-bit Unicode characters, respectively.
No, I meant UTF-16 encoding (the one with surrogate pairs),
and UTF-32
encoding (which does exists, in the Unicode standard).
> > > It's one possibility, the other one being to
call a function in the
> > > debuggee to produce the string.
> >
> > And what such a function will return? char* in
local 8-bit encoding? In
> > that case, no all wchar_t* variable can be
printed.
>
> If you want to display non-ASCII strings, it means you
already have
> some way of displaying such characters. The function I
mentioned
> would not return anything, it would actually _display_
the string.
>
> For example, in command-line version of GDB, if the
terminal supports
> UTF-8 encoded characters, that function would output a
UTF-8 encoding
> of the non-ASCII string, and then the terminal will
display them with
> the correct glyphs.
This is non-starter. I can't have debuggee send data to
KDevelop widgets.
> > > Yet another possibility is to do the
> > > conversion in your GUI front end.
> >
> > That's what I'm going to do, but first I need to
get raw data,
> > preferrably without issing an MI command for every
single character.
>
> A wchar_t string is just an array, and GDB already has
a feature to
> produce N elements of an array. In CLI, you say
"print *array 20" to
> print the first 20 elements of the named array.
I don't know how many elements there are, as wchar_t* is
zero terminated, so
I'd like gdb to compute the length automatically.
- Volodya
|
|
| printing wchar_t* |

|
2006-04-14 12:55:49 |
> From: Vladimir Prus <ghost cs.msu.su>
> Date: Fri, 14 Apr 2006 12:46:57 +0400
> Cc: gdb sources.redhat.com
>
> No, I meant UTF-16 encoding (the one with surrogate
pairs), and UTF-32
> encoding (which does exists, in the Unicode standard).
What software uses that?
Anyway, UTF-16 is a variable-length encoding, so wchar_t is
not it.
> > For example, in command-line version of GDB, if
the terminal supports
> > UTF-8 encoded characters, that function would
output a UTF-8 encoding
> > of the non-ASCII string, and then the terminal
will display them with
> > the correct glyphs.
>
> This is non-starter. I can't have debuggee send data
to KDevelop widgets.
That was just an example. I know it's irrelevant to your
case (and,
in fact, to any GUI front-end).
> > A wchar_t string is just an array, and GDB already
has a feature to
> > produce N elements of an array. In CLI, you say
"print *array 20" to
> > print the first 20 elements of the named array.
>
> I don't know how many elements there are, as wchar_t*
is zero terminated, so
> I'd like gdb to compute the length automatically.
That's easy. Assuming that is done, is it all you need?
|
|
| printing wchar_t* |

|
2006-04-14 13:00:29 |
On Friday 14 April 2006 16:55, Eli Zaretskii wrote:
> > From: Vladimir Prus <ghost cs.msu.su>
> > Date: Fri, 14 Apr 2006 12:46:57 +0400
> > Cc: gdb sources.redhat.com
> >
> > No, I meant UTF-16 encoding (the one with
surrogate pairs), and UTF-32
> > encoding (which does exists, in the Unicode
standard).
>
> What software uses that?
I'd say, any software using std::wstring on Linux.
> Anyway, UTF-16 is a variable-length encoding, so
wchar_t is not it.
Since C++ standard says nothing about encoding of wchar_t,
specific
application can do anything it likes. In particular, I
believe that on
Windows, wchar_t* is assumed to be in UTF-16 encoding.
> > > A wchar_t string is just an array, and GDB
already has a feature to
> > > produce N elements of an array. In CLI, you
say "print *array 20" to
> > > print the first 20 elements of the named
array.
> >
> > I don't know how many elements there are, as
wchar_t* is zero terminated,
> > so I'd like gdb to compute the length
automatically.
>
> That's easy. Assuming that is done, is it all you
need?
Yes, that would be sufficient for me.
- Volodya
|
|
| printing wchar_t* |

|
2006-04-14 13:06:33 |
Vladimir Prus wrote:
> On Friday 14 April 2006 16:55, Eli Zaretskii wrote:
>>> From: Vladimir Prus <ghost cs.msu.su>
>>> Date: Fri, 14 Apr 2006 12:46:57 +0400
>>> Cc: gdb sources.redhat.com
>>>
>>> No, I meant UTF-16 encoding (the one with
surrogate pairs), and UTF-32
>>> encoding (which does exists, in the Unicode
standard).
>> What software uses that?
>
> I'd say, any software using std::wstring on Linux.
>
>> Anyway, UTF-16 is a variable-length encoding, so
wchar_t is not it.
>
> Since C++ standard says nothing about encoding of
wchar_t, specific
> application can do anything it likes. In particular, I
believe that on
> Windows, wchar_t* is assumed to be in UTF-16 encoding.
It only makes sense to talk about UTF-16 encoding in the
context
of wchar_t if wchar_t is 16-bits, otherwise, as noted above,
UTF-32
is a variable length encoding, not suitable for wchar_t.
|
|
| printing wchar_t* |

|
2006-04-14 13:07:29 |
On Fri, Apr 14, 2006 at 03:55:49PM +0300, Eli Zaretskii
wrote:
> Anyway, UTF-16 is a variable-length encoding, so
wchar_t is not it.
There's a rant about this in the glibc manual I was just
reading...
In fact, on many platforms, wchar_t is only 16-bit. How
exactly you
handle UTF-8 or UCS-4 input in this case, I don't really
understand.
--
Daniel Jacobowitz
CodeSourcery
|
|
| printing wchar_t* |

|
2006-04-14 13:38:28 |
Daniel Jacobowitz wrote:
> On Fri, Apr 14, 2006 at 03:55:49PM +0300, Eli Zaretskii
wrote:
>> Anyway, UTF-16 is a variable-length encoding, so
wchar_t is not it.
>
> There's a rant about this in the glibc manual I was
just reading...
>
> In fact, on many platforms, wchar_t is only 16-bit.
How exactly you
> handle UTF-8 or UCS-4 input in this case, I don't
really understand.
Seems clear, you can only represent a limited range of codes
if you
only have 16 bits!
UTF-8 is a variable length encoding that can represent any
character
in the 32-bit range. Obviously if you have to construct
wchar_t
values from UTF-8 input, then you will not be able to
represent
characters whose codes exceed 65535. Same with UCS-4.
>
|
|
| printing wchar_t* |

|
2006-04-14 14:23:43 |
> Date: Fri, 14 Apr 2006 09:07:29 -0400
> From: Daniel Jacobowitz <drow false.org>
> Cc: Vladimir Prus <ghost cs.msu.su>, gdb sources.redhat.com
>
> On Fri, Apr 14, 2006 at 03:55:49PM +0300, Eli Zaretskii
wrote:
> > Anyway, UTF-16 is a variable-length encoding, so
wchar_t is not it.
>
> There's a rant about this in the glibc manual I was
just reading...
>
> In fact, on many platforms, wchar_t is only 16-bit.
How exactly you
> handle UTF-8 or UCS-4 input in this case, I don't
really understand.
Robert answered to that, and I agree with his response.
|
|
[1-8]
|
|