|
List Info
Thread: rtl::OUString::iterateCodePoints
|
|
| rtl::OUString::iterateCodePoints |

|
2007-05-09 07:10:35 |
Hi all,
<http://www.openoffice.org/issues/show_bug.cgi?id=76869
> requests
functionality to work on an rtl::OUString as a sequence of
Unicode
scalar values or code points, rather than a sequence of
UTF-16 code units.
What I came up with is the minimalistic
rtl_uString_iterateCodePoints in
rtl/ustring.h (see below) and an accompanying public
rtl::OUString
member function
inline sal_uInt32 iterateCodePoints(
sal_Int32 * indexUtf16, sal_Int32
postIncrementCodePoints = 1);
that is an almost trivial wrapper around it.
Any comments? Especially, I am interested in the following
two points:
1 Would there be legitimate use cases for
rtl_uString_iterateCodePoints
to adjust an incoming index that points into the middle of a
surrogate
pair, or would that only hide broken code?
2 With the current setup where moving past the beginning or
end of the
string is undefined behavior, is there any use for
postIncrementCodePoints outside [-1 .. 1]? Or would there
be legitimate
use cases for rtl_uString_iterateCodePoints to stop moving
past the
beginning/end of the string when postIncrementCodePoints is
too large?
-Stephan
/** Iterate through a string based on code points instead of
UTF-16 code
units.
See Chapter 3 of The Unicode Standard 5.0
(Addison--Wesley, 2006)
for definitions of the various terms used in this
description.
The given string is interpreted as a sequence of zero
or more UTF-16
code units. For each index into this sequence (from
zero to the
length of the sequence, inclusive), a code point
represented
starting at the given index is computed as follows:
- If the index points to the end of the sequence, the
computed code
point is the special marker SAL_MAX_UINT32.
- Otherwise, if the UTF-16 code unit addressed by the
index
constitutes a well-formed UTF-16 code unit sequence,
the computed
code point is the scalar value encoded by that UTF-16
code unit
sequence.
- Otherwise, if the index is at least two UTF-16 code
units away
from the end of the sequence, and the sequence of two
UTF-16 code
units addressed by the index constitutes a well-formed
UTF-16 code
unit sequence, the computed code point is the scalar
value encoded
by that UTF-16 code unit sequence.
- Otherwise, the computed code point is the UTF-16 code
unit
addressed by the index. (This last case catches
unmatched
surrogates as well as indices pointing into the middle
of surrogate
pairs.)
param string
pointer to a valid string; must not be null.
param indexUtf16
pointer to a UTF-16 based index into the given string;
must not be
null. On entry, the index must be in the range from
zero to the
length of the string (in UTF-16 code units), inclusive.
Upon
successful return, the index will be updated to address
the UTF-16
code unit that is the given postIncrementCodePoints
away from the
initial index.
param postIncrementCodePoints
the number of code points to move the given indexUtf16;
can be
negative. The value must be such that the resulting
UTF-16 based
index is in the range from zero to the length of the
string (in
UTF-16 code units), inclusive.
return
the code point (an integer in the range from 0 to
0x10FFFF,
inclusive) or the special marker SAL_UINT_MAX that is
represented at
the given indexUtf16 starting index within the given
string.
since UDK 3.2.7
*/
sal_uInt32 SAL_CALL rtl_uString_iterateCodePoints(
rtl_uString const * string, sal_Int32 * indexUtf16,
sal_Int32 postIncrementCodePoints);
------------------------------------------------------------
---------
To unsubscribe, e-mail: interface-discuss-unsubscribe openoffice.org
For additional commands, e-mail: interface-discuss-help openoffice.org
|
|
| Re: rtl::OUString::iterateCodePoints |

|
2007-05-11 14:14:25 |
Hi Stephan,
On Wednesday, 2007-05-09 14:10:35 +0200, Stephan Bergmann
wrote:
> 1 Would there be legitimate use cases for
rtl_uString_iterateCodePoints
> to adjust an incoming index that points into the middle
of a surrogate
> pair, or would that only hide broken code?
I think that in the current state it would more hide broken
code than
being useful. Instead, other functions like those mentioned
in i76869
could be introduced, if synchronization is needed. On the
other hand,
especially finding the start of a code point may be useful
when
iterating backwards from the end of the string and a
surrogate is the
last two code units. Maybe that's a special case?
> 2 With the current setup where moving past the
beginning or end of the
> string is undefined behavior, is there any use for
> postIncrementCodePoints outside [-1 .. 1]?
There may be in scenarios like "next I'll be interested
in the character
after the next", so postIncrementCodePoints would be
2.
> Or would there be legitimate
> use cases for rtl_uString_iterateCodePoints to stop
moving past the
> beginning/end of the string when
postIncrementCodePoints is too large?
I think it should stop if it is called with indexUtf16 being
"outside"
the string, or resulting in such a value, so -1 and length
would be the
min/max resulting values. Also,
| param postIncrementCodePoints
| the number of code points to move the given indexUtf16;
can be negative.
| The value must be such that the resulting UTF-16 based
index is in the
| range from zero to the length of this string (in UTF-16
code units),
| inclusive.
leaves the impression that in
sal_Int32 nIndex = str.getLength() - 1;
str.iterateCodePoints( &nIndex, 2 )
the value of postIncrementCodePoints would be invalid
because it would
increment nIndex beyond the length. Instead, the function
should limit
nIndex to str.getLength() upon return.
Eike
--
OOo/SO Calc core developer. Number formatter stricken i18n
transpositionizer.
OpenOffice.org Engineering at Sun: http://blogs.sun.com/Gu
llFOSS
Please don't send personal mail to this erl sun.com
account, which I use for
mailing lists only and don't read from outside Sun.
Thanks.
------------------------------------------------------------
---------
To unsubscribe, e-mail: interface-discuss-unsubscribe openoffice.org
For additional commands, e-mail: interface-discuss-help openoffice.org
|
|
| Re: rtl::OUString::iterateCodePoints |

|
2007-05-30 09:26:34 |
Eike Rathke wrote:
> Hi Stephan,
>
> On Wednesday, 2007-05-09 14:10:35 +0200, Stephan
Bergmann wrote:
>
>> 1 Would there be legitimate use cases for
rtl_uString_iterateCodePoints
>> to adjust an incoming index that points into the
middle of a surrogate
>> pair, or would that only hide broken code?
>
> I think that in the current state it would more hide
broken code than
> being useful. Instead, other functions like those
mentioned in i76869
> could be introduced, if synchronization is needed. On
the other hand,
> especially finding the start of a code point may be
useful when
> iterating backwards from the end of the string and a
surrogate is the
> last two code units. Maybe that's a special case?
sal_Int32 i = s.getLength();
s.iterateCodePoints(&i, -1);
will make i point to the start of the last character (if s
is nonempty).
>> 2 With the current setup where moving past the
beginning or end of the
>> string is undefined behavior, is there any use for
>> postIncrementCodePoints outside [-1 .. 1]?
>
> There may be in scenarios like "next I'll be
interested in the character
> after the next", so postIncrementCodePoints would
be 2.
My point was that you can only safely make that call if you
know that
there are at least two more code points after the current
index, which
in general you can only know if you inspect the
"surrogate structure" of
the OUString at the sal_Unicode level (which
iterateCodePoints should
shield you from). (Whether you can safely make a call with
postIncrementCodePoints in [-1 .. 1] is easily checkable by
the caller,
on the other hand.)
>> Or would there be legitimate
>> use cases for rtl_uString_iterateCodePoints to stop
moving past the
>> beginning/end of the string when
postIncrementCodePoints is too large?
>
> I think it should stop if it is called with indexUtf16
being "outside"
> the string, or resulting in such a value, so -1 and
length would be the
> min/max resulting values. Also,
Why -1 instead of 0?
> | param postIncrementCodePoints
> | the number of code points to move the given
indexUtf16; can be negative.
> | The value must be such that the resulting UTF-16
based index is in the
> | range from zero to the length of this string (in
UTF-16 code units),
> | inclusive.
>
> leaves the impression that in
>
> sal_Int32 nIndex = str.getLength() - 1;
> str.iterateCodePoints( &nIndex, 2 )
>
> the value of postIncrementCodePoints would be invalid
because it would
> increment nIndex beyond the length. Instead, the
function should limit
> nIndex to str.getLength() upon return.
The nice thing about having it undefined behavior for now is
that if
there ever turns up demand to do clip excessive moves at 0
resp. length,
then that can easily be implemented as a backwards
compatible change.
-Stephan
------------------------------------------------------------
---------
To unsubscribe, e-mail: interface-discuss-unsubscribe openoffice.org
For additional commands, e-mail: interface-discuss-help openoffice.org
|
|
| Re: rtl::OUString::iterateCodePoints |

|
2007-05-31 09:06:26 |
Hi Stephan,
On Wednesday, 2007-05-30 16:26:34 +0200, Stephan Bergmann
wrote:
> >especially finding the start of a code point may be
useful when
> >iterating backwards from the end of the string and
a surrogate is the
> >last two code units. Maybe that's a special case?
>
> sal_Int32 i = s.getLength();
> s.iterateCodePoints(&i, -1);
>
> will make i point to the start of the last character
(if s is nonempty).
Ah, nice, a detail that wasn't clear to me.
> >>2 With the current setup where moving past the
beginning or end of the
> >>string is undefined behavior, is there any use
for
> >>postIncrementCodePoints outside [-1 .. 1]?
> >
> >There may be in scenarios like "next I'll be
interested in the character
> >after the next", so postIncrementCodePoints
would be 2.
>
> My point was that you can only safely make that call if
you know that
> there are at least two more code points after the
current index, which
> in general you can only know if you inspect the
"surrogate structure" of
> the OUString at the sal_Unicode level (which
iterateCodePoints should
> shield you from).
True. So, then I assume we don't need other postincrement
values.
> >>Or would there be legitimate
> >>use cases for rtl_uString_iterateCodePoints to
stop moving past the
> >>beginning/end of the string when
postIncrementCodePoints is too large?
> >
> >I think it should stop if it is called with
indexUtf16 being "outside"
> >the string, or resulting in such a value, so -1 and
length would be the
> >min/max resulting values. Also,
>
> Why -1 instead of 0?
I thought of -1 signalling an end condition in reverse
iteration, as
does 'length' in forward iteration, both point
"outside" the string and
would follow the general [...[ inclusive/exclusive
approach.
A forward loop would look like
for(i=0; i<s.getLength(); )
{
c = s.iterateCodePoints( &i, +1);
}
A similar reverse loop
for(i=s.getLength(), s.iterateCodePoints( &i, -1);
i>=0; )
{
c = s.iterateCodePoints( &i, -1);
}
would not work if 0 was the smallest indexUtf16 value
returned in i, one
would have to insert an if(i==0)break; condition at the end
of the loop,
quite ugly.. Furthermore the length had to be checked in
advance as well
to not enter the loop with an empty string. Altogether
nasty, I'd say.
Eike
--
OOo/SO Calc core developer. Number formatter stricken i18n
transpositionizer.
OpenOffice.org Engineering at Sun: http://blogs.sun.com/Gu
llFOSS
Please don't send personal mail to this erl sun.com
account, which I use for
mailing lists only and don't read from outside Sun.
Thanks.
------------------------------------------------------------
---------
To unsubscribe, e-mail: interface-discuss-unsubscribe openoffice.org
For additional commands, e-mail: interface-discuss-help openoffice.org
|
|
| Re: rtl::OUString::iterateCodePoints |

|
2007-05-31 10:49:24 |
Eike Rathke wrote:
>>>> 2 With the current setup where moving past
the beginning or end of the
>>>> string is undefined behavior, is there any
use for
>>>> postIncrementCodePoints outside [-1 .. 1]?
>>> There may be in scenarios like "next I'll
be interested in the character
>>> after the next", so
postIncrementCodePoints would be 2.
>> My point was that you can only safely make that
call if you know that
>> there are at least two more code points after the
current index, which
>> in general you can only know if you inspect the
"surrogate structure" of
>> the OUString at the sal_Unicode level (which
iterateCodePoints should
>> shield you from).
>
> True. So, then I assume we don't need other
postincrement values.
Think so too. Anyway, having the more general case
available (even if
probably not of much use) does not really hurt, so I will
leave that in.
>>>> Or would there be legitimate
>>>> use cases for rtl_uString_iterateCodePoints
to stop moving past the
>>>> beginning/end of the string when
postIncrementCodePoints is too large?
>>> I think it should stop if it is called with
indexUtf16 being "outside"
>>> the string, or resulting in such a value, so -1
and length would be the
>>> min/max resulting values. Also,
>> Why -1 instead of 0?
>
> I thought of -1 signalling an end condition in reverse
iteration, as
> does 'length' in forward iteration, both point
"outside" the string and
> would follow the general [...[ inclusive/exclusive
approach.
But what should
sal_Int32 i = -1;
s.iterateCodePoints(&i, 1);
mean then? Pseudo-iterate forward to i == 0?
But you are right, reverse-iterating code does look more
awkward. Would
it help if postIncrementCodePoints actually acted as
preIncrementCodePoints if it is negative? Is not that what
we want? Or
is it to confusing?
sal_Int32 i = s.getLength();
while (i != 0) {
sal_uInt32 c = s.iterateCodePoints(&i, -1);
}
would then neatly reverse-iterate through any string, and we
would get
rid of the ugly SAL_MAX_UINT32 special-case return value.
-Stephan
> A forward loop would look like
>
> for(i=0; i<s.getLength(); )
> {
> c = s.iterateCodePoints( &i, +1);
> }
>
> A similar reverse loop
>
> for(i=s.getLength(), s.iterateCodePoints( &i,
-1); i>=0; )
> {
> c = s.iterateCodePoints( &i, -1);
> }
>
> would not work if 0 was the smallest indexUtf16 value
returned in i, one
> would have to insert an if(i==0)break; condition at the
end of the loop,
> quite ugly.. Furthermore the length had to be checked
in advance as well
> to not enter the loop with an empty string. Altogether
nasty, I'd say.
>
> Eike
------------------------------------------------------------
---------
To unsubscribe, e-mail: interface-discuss-unsubscribe openoffice.org
For additional commands, e-mail: interface-discuss-help openoffice.org
|
|
| Re: rtl::OUString::iterateCodePoints |

|
2007-05-31 13:41:31 |
Hi Stephan,
On Thursday, 2007-05-31 17:49:24 +0200, Stephan Bergmann
wrote:
> >>Why -1 instead of 0?
> >
> >I thought of -1 signalling an end condition in
reverse iteration, as
> >does 'length' in forward iteration, both point
"outside" the string and
> >would follow the general [...[ inclusive/exclusive
approach.
>
> But what should
>
> sal_Int32 i = -1;
> s.iterateCodePoints(&i, 1);
>
> mean then? Pseudo-iterate forward to i == 0?
Yes, analogous to reverse-iterating with i=s.getLength().
However, that
case may be a bit pathological.. it also needs to return
SAL_MAX_UINT32
again. So, if we preincremented on reverse-iteration like
mentioned
below, what would this situation give? The same?
> But you are right, reverse-iterating code does look
more awkward. Would
> it help if postIncrementCodePoints actually acted as
> preIncrementCodePoints if it is negative? Is not that
what we want?
It is.
> Or is it to confusing?
I don't think so. Well, maybe at the beginning, but it does
what we want
it just reverses the entire behavior.
> sal_Int32 i = s.getLength();
> while (i != 0) {
> sal_uInt32 c = s.iterateCodePoints(&i, -1);
> }
>
> would then neatly reverse-iterate through any string,
and we would get
> rid of the ugly SAL_MAX_UINT32 special-case return
value.
What if 'i' is 0, and maybe 's' also an empty string? This
wouldn't
happen in a proper loop, but a call to iterateCodePoints()
in these
cases would result in what?
s.iterateCodePoints(&i, +1) => i== 0 ? because
getLength()==0
s.iterateCodePoints(&i, 0) => i== 0 ? because not
iterating
s.iterateCodePoints(&i, -1) => i==-1 ? because
preincremented past the beginning
And the return value?
Eike
--
OOo/SO Calc core developer. Number formatter stricken i18n
transpositionizer.
OpenOffice.org Engineering at Sun: http://blogs.sun.com/Gu
llFOSS
Please don't send personal mail to this erl sun.com
account, which I use for
mailing lists only and don't read from outside Sun.
Thanks.
------------------------------------------------------------
---------
To unsubscribe, e-mail: interface-discuss-unsubscribe openoffice.org
For additional commands, e-mail: interface-discuss-help openoffice.org
|
|
| Re: rtl::OUString::iterateCodePoints |

|
2007-06-01 04:19:56 |
Eike Rathke wrote:
>> But you are right, reverse-iterating code does look
more awkward. Would
>> it help if postIncrementCodePoints actually acted
as
>> preIncrementCodePoints if it is negative? Is not
that what we want?
>
> It is.
>
>> Or is it to confusing?
>
> I don't think so. Well, maybe at the beginning, but it
does what we want
> it just reverses the entire behavior.
>
>> sal_Int32 i = s.getLength();
>> while (i != 0) {
>> sal_uInt32 c = s.iterateCodePoints(&i,
-1);
>> }
>>
>> would then neatly reverse-iterate through any
string, and we would get
>> rid of the ugly SAL_MAX_UINT32 special-case return
value.
>
> What if 'i' is 0, and maybe 's' also an empty string?
This wouldn't
> happen in a proper loop, but a call to
iterateCodePoints() in these
> cases would result in what?
>
> s.iterateCodePoints(&i, +1) => i== 0 ? because
getLength()==0
> s.iterateCodePoints(&i, 0) => i== 0 ? because
not iterating
> s.iterateCodePoints(&i, -1) => i==-1 ? because
preincremented past the beginning
>
> And the return value?
All three cases would be undefined behavior. I would change
the
preconditions for
iterateCodePoints(
sal_Int32 * indexUtf16, sal_Int32 incrementCodePoints)
as follows:
- indexUtf16 must not be null
- if incrementCodePoints >= 0:
- *indexUtf16 must be in [0 .. length[
- *indexUtf16 + incrementCodePoints must be in [0 ..
length]
- if incrementCodePoints < 0:
- *indexUtf16 must be in [0 .. length]
- *indexUtf16 + incrementCodePoints must be in [0 ..
length[
-Stephan
------------------------------------------------------------
---------
To unsubscribe, e-mail: interface-discuss-unsubscribe openoffice.org
For additional commands, e-mail: interface-discuss-help openoffice.org
|
|
[1-7]
|
|