List Info

Thread: printing wchar_t*




printing wchar_t*
user name
2006-04-14 08:30:59
> From:  Vladimir Prus <ghostcs.msu.su>
> Date:  Fri, 14 Apr 2006 10:01:57 +0400
> 
> > What character set is used by the wide characters
in the wchar_t
> > arrays?  GDB has some support for a few
single-byte character sets,
> > see the node "Character Sets" in the
manual.
> 
> Relatively safe bet would be to assume it's some
zero-terminated character
> set. I plan to assume it's either UTF-16 or UTF-32 in
the GUI (the
> conversion code is the same for both encodings), but
gdb can just print raw
> values.

We should get our terminology right: UTF-16 is not a
character set,
it's an encoding (and a multibyte encoding, btw).  As for
UTF-32, I
don't think such a beast exists at all.

I think you meant 16-bit Unicode characters (a.k.a. the BMP)
and
32-bit Unicode characters, respectively.

> > It's one possibility, the other one being to call
a function in the
> > debuggee to produce the string. 
> 
> And what such a function will return? char* in local
8-bit encoding? In that
> case, no all wchar_t* variable can be printed.

If you want to display non-ASCII strings, it means you
already have
some way of displaying such characters.  The function I
mentioned
would not return anything, it would actually _display_ the
string.

For example, in command-line version of GDB, if the terminal
supports
UTF-8 encoded characters, that function would output a UTF-8
encoding
of the non-ASCII string, and then the terminal will display
them with
the correct glyphs.

> > Yet another possibility is to do the 
> > conversion in your GUI front end.
> 
> That's what I'm going to do, but first I need to get
raw data, preferrably
> without issing an MI command for every single
character.

A wchar_t string is just an array, and GDB already has a
feature to
produce N elements of an array.  In CLI, you say
"print *array20" to
print the first 20 elements of the named array.
printing wchar_t*
user name
2006-04-14 08:46:57
On Friday 14 April 2006 12:30, Eli Zaretskii wrote:

> > Relatively safe bet would be to assume it's some
zero-terminated
> > character set. I plan to assume it's either
UTF-16 or UTF-32 in the GUI
> > (the conversion code is the same for both
encodings), but gdb can just
> > print raw values.
>
> We should get our terminology right: UTF-16 is not a
character set,
> it's an encoding (and a multibyte encoding, btw).  As
for UTF-32, I
> don't think such a beast exists at all.
>
> I think you meant 16-bit Unicode characters (a.k.a. the
BMP) and
> 32-bit Unicode characters, respectively.

No, I meant UTF-16 encoding (the one with surrogate pairs),
and UTF-32 
encoding (which does exists, in the Unicode standard).

> > > It's one possibility, the other one being to
call a function in the
> > > debuggee to produce the string.
> >
> > And what such a function will return? char* in
local 8-bit encoding? In
> > that case, no all wchar_t* variable can be
printed.
>
> If you want to display non-ASCII strings, it means you
already have
> some way of displaying such characters.  The function I
mentioned
> would not return anything, it would actually _display_
the string.
>
> For example, in command-line version of GDB, if the
terminal supports
> UTF-8 encoded characters, that function would output a
UTF-8 encoding
> of the non-ASCII string, and then the terminal will
display them with
> the correct glyphs.

This is non-starter. I can't have debuggee send data to
KDevelop widgets.

> > > Yet another possibility is to do the
> > > conversion in your GUI front end.
> >
> > That's what I'm going to do, but first I need to
get raw data,
> > preferrably without issing an MI command for every
single character.
>
> A wchar_t string is just an array, and GDB already has
a feature to
> produce N elements of an array.  In CLI, you say
"print *array20" to
> print the first 20 elements of the named array.

I don't know how many elements there are, as wchar_t* is
zero terminated, so 
I'd like gdb to compute the length automatically.

- Volodya


printing wchar_t*
user name
2006-04-14 12:55:49
> From: Vladimir Prus <ghostcs.msu.su>
> Date: Fri, 14 Apr 2006 12:46:57 +0400
> Cc: gdbsources.redhat.com
> 
> No, I meant UTF-16 encoding (the one with surrogate
pairs), and UTF-32 
> encoding (which does exists, in the Unicode standard).

What software uses that?

Anyway, UTF-16 is a variable-length encoding, so wchar_t is
not it.

> > For example, in command-line version of GDB, if
the terminal supports
> > UTF-8 encoded characters, that function would
output a UTF-8 encoding
> > of the non-ASCII string, and then the terminal
will display them with
> > the correct glyphs.
> 
> This is non-starter. I can't have debuggee send data
to KDevelop widgets.

That was just an example.  I know it's irrelevant to your
case (and,
in fact, to any GUI front-end).

> > A wchar_t string is just an array, and GDB already
has a feature to
> > produce N elements of an array.  In CLI, you say
"print *array20" to
> > print the first 20 elements of the named array.
> 
> I don't know how many elements there are, as wchar_t*
is zero terminated, so 
> I'd like gdb to compute the length automatically.

That's easy.  Assuming that is done, is it all you need?
printing wchar_t*
user name
2006-04-14 13:00:29
On Friday 14 April 2006 16:55, Eli Zaretskii wrote:
> > From: Vladimir Prus <ghostcs.msu.su>
> > Date: Fri, 14 Apr 2006 12:46:57 +0400
> > Cc: gdbsources.redhat.com
> >
> > No, I meant UTF-16 encoding (the one with
surrogate pairs), and UTF-32
> > encoding (which does exists, in the Unicode
standard).
>
> What software uses that?

I'd say, any software using std::wstring on Linux.

> Anyway, UTF-16 is a variable-length encoding, so
wchar_t is not it.

Since C++ standard says nothing about encoding of wchar_t,
specific 
application can do anything it likes. In particular, I
believe that on 
Windows, wchar_t* is assumed to be in UTF-16 encoding.

> > > A wchar_t string is just an array, and GDB
already has a feature to
> > > produce N elements of an array.  In CLI, you
say "print *array20" to
> > > print the first 20 elements of the named
array.
> >
> > I don't know how many elements there are, as
wchar_t* is zero terminated,
> > so I'd like gdb to compute the length
automatically.
>
> That's easy.  Assuming that is done, is it all you
need?

Yes, that would be sufficient for me.

- Volodya

printing wchar_t*
user name
2006-04-14 13:06:33
Vladimir Prus wrote:
> On Friday 14 April 2006 16:55, Eli Zaretskii wrote:
>>> From: Vladimir Prus <ghostcs.msu.su>
>>> Date: Fri, 14 Apr 2006 12:46:57 +0400
>>> Cc: gdbsources.redhat.com
>>>
>>> No, I meant UTF-16 encoding (the one with
surrogate pairs), and UTF-32
>>> encoding (which does exists, in the Unicode
standard).
>> What software uses that?
> 
> I'd say, any software using std::wstring on Linux.
> 
>> Anyway, UTF-16 is a variable-length encoding, so
wchar_t is not it.
> 
> Since C++ standard says nothing about encoding of
wchar_t, specific 
> application can do anything it likes. In particular, I
believe that on 
> Windows, wchar_t* is assumed to be in UTF-16 encoding.

It only makes sense to talk about UTF-16 encoding in the
context
of wchar_t if wchar_t is 16-bits, otherwise, as noted above,
UTF-32
is a variable length encoding, not suitable for wchar_t.

printing wchar_t*
user name
2006-04-14 13:07:29
On Fri, Apr 14, 2006 at 03:55:49PM +0300, Eli Zaretskii
wrote:
> Anyway, UTF-16 is a variable-length encoding, so
wchar_t is not it.

There's a rant about this in the glibc manual I was just
reading...

In fact, on many platforms, wchar_t is only 16-bit.  How
exactly you
handle UTF-8 or UCS-4 input in this case, I don't really
understand.

-- 
Daniel Jacobowitz
CodeSourcery
printing wchar_t*
user name
2006-04-14 13:38:28
Daniel Jacobowitz wrote:
> On Fri, Apr 14, 2006 at 03:55:49PM +0300, Eli Zaretskii
wrote:
>> Anyway, UTF-16 is a variable-length encoding, so
wchar_t is not it.
> 
> There's a rant about this in the glibc manual I was
just reading...
> 
> In fact, on many platforms, wchar_t is only 16-bit. 
How exactly you
> handle UTF-8 or UCS-4 input in this case, I don't
really understand.

Seems clear, you can only represent a limited range of codes
if you
only have 16 bits!

UTF-8 is a variable length encoding that can represent any
character
in the 32-bit range. Obviously if you have to construct
wchar_t
values from UTF-8 input, then you will not be able to
represent
characters whose codes exceed 65535. Same with UCS-4.
> 


printing wchar_t*
user name
2006-04-14 14:23:43
> Date: Fri, 14 Apr 2006 09:07:29 -0400
> From: Daniel Jacobowitz <drowfalse.org>
> Cc: Vladimir Prus <ghostcs.msu.su>, gdbsources.redhat.com
> 
> On Fri, Apr 14, 2006 at 03:55:49PM +0300, Eli Zaretskii
wrote:
> > Anyway, UTF-16 is a variable-length encoding, so
wchar_t is not it.
> 
> There's a rant about this in the glibc manual I was
just reading...
> 
> In fact, on many platforms, wchar_t is only 16-bit. 
How exactly you
> handle UTF-8 or UCS-4 input in this case, I don't
really understand.

Robert answered to that, and I agree with his response.
[1-8]

about | contact  Other archives ( Real Estate discussion Medical topics )