|
List Info
Thread: Unicode word boundary
|
|
| Unicode word boundary |
  United States |
2007-08-10 09:53:49 |
Hello, I'm using .NET and my purpose is to find a word that
consist of
2 or more "extended ASCII" (128-255) characters
ONLY.
For instance: "àãó" is OK, "abàã" is not
because it contains 2
"regular" ascii's at the beginning.
The core of the regex is : "[u0080-u00FF]{2,}"
which works
perfectly.
The problem starts when i want to wrap it in b's (I need a
whole word
only). Some unicode characters are considered stop marks and
are "word
cutters", for instance:
Using "b[u0080-u00FF]{2,}b" - "Ö°A"
is considered a match , becase
the middle character is treated as a word stopper, so the
last "A" is
ignored. This shouldn't have been a match because im looking
for words
containing only "irregular" characters, and
"Ö°A" contains an "A".
I know I should be using p but cant find the
right use.
Any help will be more than welcomed.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
| Re: Unicode word boundary |
  United States |
2007-08-13 09:04:36 |
> Using "b[u0080-u00FF]{2,}b" -
"Ö°A" is considered a match , becase
> the middle character is treated as a word stopper, so
the last "A" is
> ignored. This shouldn't have been a match because im
looking for words
> containing only "irregular" characters, and
"Ö°A" contains an "A".
Hi Yonido,
Since the b is giving you an inexact match, I tried s+
SOURCE:
For instance: àãó is OK
junk abàã is not
junk Ö°A is considered
PATTERN:
s+[u0080-u00FF]{2,}s+
RESULTS:
Match
àãó
Does this help?
Thanks!
Syd
On Aug 10, 7:53 pm, yonido <yon... gmail.com> wrote:
> Hello, I'm using .NET and my purpose is to find a word
that consist of
> 2 or more "extended ASCII" (128-255)
characters ONLY.
> For instance: "àãó" is OK, "abàã"
is not because it contains 2
> "regular" ascii's at the beginning.
>
> The core of the regex is :
"[u0080-u00FF]{2,}" which works
> perfectly.
>
> The problem starts when i want to wrap it in b's (I
need a whole word
> only). Some unicode characters are considered stop
marks and are "word
> cutters", for instance:
> Using "b[u0080-u00FF]{2,}b" -
"Ö°A" is considered a match , becase
> the middle character is treated as a word stopper, so
the last "A" is
> ignored. This shouldn't have been a match because im
looking for words
> containing only "irregular" characters, and
"Ö°A" contains an "A".
>
> I know I should be using p but cant find the
right use.
>
> Any help will be more than welcomed.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
| Re: Unicode word boundary |
  United States |
2007-08-13 10:55:16 |
Hello Syd and thanks for replying!
Problem with s+ is that it requires a space before and
after the
word.
If I have the word "àãó" ONLY - without spaces
before and after it,
your regex will not match (b would).
I need to imitate b - for unicode..
On Aug 13, 4:04 pm, "syd...... gmail.com"
<sydc... gmail.com> wrote:
> > Using "b[u0080-u00FF]{2,}b" -
"Ö°A" is considered a match , becase
> > the middle character is treated as a word stopper,
so the last "A" is
> > ignored. This shouldn't have been a match because
im looking for words
> > containing only "irregular" characters,
and "Ö°A" contains an "A".
>
> Hi Yonido,
>
> Since the b is giving you an inexact match, I tried
s+
>
> SOURCE:
> For instance: àãó is OK
> junk abàã is not
> junk Ö°A is considered
>
> PATTERN:
> s+[u0080-u00FF]{2,}s+
>
> RESULTS:
> Match
> àãó
>
> Does this help?
>
> Thanks!
> Syd
>
> On Aug 10, 7:53 pm, yonido <yon... gmail.com> wrote:
>
>
>
> > Hello, I'm using .NET and my purpose is to find a
word that consist of
> > 2 or more "extended ASCII" (128-255)
characters ONLY.
> > For instance: "àãó" is OK,
"abàã" is not because it contains 2
> > "regular" ascii's at the beginning.
>
> > The core of the regex is :
"[u0080-u00FF]{2,}" which works
> > perfectly.
>
> > The problem starts when i want to wrap it in b's
(I need a whole word
> > only). Some unicode characters are considered stop
marks and are "word
> > cutters", for instance:
> > Using "b[u0080-u00FF]{2,}b" -
"Ö°A" is considered a match , becase
> > the middle character is treated as a word stopper,
so the last "A" is
> > ignored. This shouldn't have been a match because
im looking for words
> > containing only "irregular" characters,
and "Ö°A" contains an "A".
>
> > I know I should be using p but cant
find the right use.
>
> > Any help will be more than welcomed.- Hide quoted
text -
>
> - Show quoted text -
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
| Re: Unicode word boundary |
  United States |
2007-08-14 05:09:10 |
>> I need to imitate b - for unicode..
Hi Yonido,
I jumped to a conclusion - bad one at that.
Well, I didn't really know what Unicode was, and now maybe I
do
I read through the superb page http
://www.regular-expressions.info/unicode.html
and stitched something that appears to work. Please do visit
this
page. I cannot attempt to being an Unicode expert after
reading that!!
Anyway, here are the results.
SOURCE:
"àãó" is OK
"abàã" is not
"Ö°A" is considered a match
Ö°A
àãó
Ö°Ö°A
Aàãó
'àãó'
Pattern:
[^pp'"][u0080-u00FF]{2,}[^pp'"]
Results:
Match
"àãó"
àãó
Ö°Ö°
'àãó'
It is still picking up the " and '
We could eliminate that by making our expression
[^pp'"]
[u0080-u00FF]{2,}[^pp'"] but that's your
call...
Please let me know if this is accurate or if you had to tune
this to
work.
I am sure you will be able to understand the Unicode url and
make this
work. All the best!!
Thanks!
Syd
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
| Re: Unicode word boundary |
  United States |
2007-08-14 05:25:12 |
Hello Syd and many thanks for the efforts.
I actually read this page about unicode when i first started
writing
this RegEx and indeed the pL and pM can be used.
But you still dont get the real problem.
Your patterns requires that there will be SOMETHING before
the "àãó"
characters (the "[^pp'"]" part of
your pattern), and SOMETHING
after them.
What about an input that is simply "àãó" (WITHOUT
the quotes)? it
should match, but there's nothing before it. So is
"blah àãó" which
has nothing after it.
the obvious solution would be to add a "?" to the
"[^pp'"]" to
make it optional but then almost everything will match (for
example
"aaaàãó").
That is the exact purpose of b which i cant figure how to
imitate.
Thanks!
On Aug 14, 12:09 pm, "syd...... gmail.com"
<sydc... gmail.com> wrote:
> >> I need to imitate b - for unicode..
>
> Hi Yonido,
>
> I jumped to a conclusion - bad one at that.
> Well, I didn't really know what Unicode was, and now
maybe I do
>
> I read through the superb pagehttp
://www.regular-expressions.info/unicode.html
> and stitched something that appears to work. Please do
visit this
> page. I cannot attempt to being an Unicode expert after
reading that!!
>
> Anyway, here are the results.
>
> SOURCE:
> "àãó" is OK
> "abàã" is not
> "Ö°A" is considered a match
> Ö°A
> àãó
> Ö°Ö°A
> Aàãó
> 'àãó'
>
> Pattern:
>
[^pp'"][u0080-u00FF]{2,}[^pp'"]
>
> Results:
> Match
> "àãó"
> àãó
> Ö°Ö°
> 'àãó'
>
> It is still picking up the " and '
> We could eliminate that by making our expression
[^pp'"]
> [u0080-u00FF]{2,}[^pp'"] but that's your
call...
>
> Please let me know if this is accurate or if you had to
tune this to
> work.
> I am sure you will be able to understand the Unicode
url and make this
> work. All the best!!
>
> Thanks!
> Syd
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
| Re: Unicode word boundary |
  United States |
2007-08-14 06:08:21 |
Well I finally did it (with some help). The result is:
(^|[^p])[x80-xFF]{2,}($|[^p])
I did not know that you can (^|s) for instance, meaning
(start of
string OR space), which is used here.
Thanks again for the help!
On Aug 14, 12:25 pm, yonido <yon... gmail.com> wrote:
> Hello Syd and many thanks for the efforts.
> I actually read this page about unicode when i first
started writing
> this RegEx and indeed the pL and pM can be used.
> But you still dont get the real problem.
> Your patterns requires that there will be SOMETHING
before the "àãó"
> characters (the "[^pp'"]" part
of your pattern), and SOMETHING
> after them.
> What about an input that is simply "àãó"
(WITHOUT the quotes)? it
> should match, but there's nothing before it. So is
"blah àãó" which
> has nothing after it.
> the obvious solution would be to add a "?" to
the "[^pp'"]" to
> make it optional but then almost everything will match
(for example
> "aaaàãó").
> That is the exact purpose of b which i cant figure how
to imitate.
>
> Thanks!
>
> On Aug 14, 12:09 pm, "syd...... gmail.com" <sydc... gmail.com> wrote:
>
>
>
> > >> I need to imitate b - for unicode..
>
> > Hi Yonido,
>
> > I jumped to a conclusion - bad one at that.
> > Well, I didn't really know what Unicode was, and
now maybe I do
>
> > I read through the superb pagehttp
://www.regular-expressions.info/unicode.html
> > and stitched something that appears to work.
Please do visit this
> > page. I cannot attempt to being an Unicode expert
after reading that!!
>
> > Anyway, here are the results.
>
> > SOURCE:
> > "àãó" is OK
> > "abàã" is not
> > "Ö°A" is considered a match
> > Ö°A
> > àãó
> > Ö°Ö°A
> > Aàãó
> > 'àãó'
>
> > Pattern:
> >
[^pp'"][u0080-u00FF]{2,}[^pp'"]
>
> > Results:
> > Match
> > "àãó"
> > àãó
> > Ö°Ö°
> > 'àãó'
>
> > It is still picking up the " and '
> > We could eliminate that by making our expression
[^pp'"]
> > [u0080-u00FF]{2,}[^pp'"] but that's
your call...
>
> > Please let me know if this is accurate or if you
had to tune this to
> > work.
> > I am sure you will be able to understand the
Unicode url and make this
> > work. All the best!!
>
> > Thanks!
> > Syd- Hide quoted text -
>
> - Show quoted text -
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
| Re: Unicode word boundary |
  United States |
2007-08-14 12:38:10 |
Glad to be of help. Have a good one!
On Aug 14, 4:08 pm, yonido <yon... gmail.com> wrote:
> Well I finally did it (with some help). The result is:
>
> (^|[^p])[x80-xFF]{2,}($|[^p])
>
> I did not know that you can (^|s) for instance,
meaning (start of
> string OR space), which is used here.
>
> Thanks again for the help!
>
> On Aug 14, 12:25 pm, yonido <yon... gmail.com> wrote:
>
>
>
> > Hello Syd and many thanks for the efforts.
> > I actually read this page about unicode when i
first started writing
> > this RegEx and indeed the pL and pM can be
used.
> > But you still dont get the real problem.
> > Your patterns requires that there will be
SOMETHING before the "àãó"
> > characters (the "[^pp'"]"
part of your pattern), and SOMETHING
> > after them.
> > What about an input that is simply "àãó"
(WITHOUT the quotes)? it
> > should match, but there's nothing before it. So is
"blah àãó" which
> > has nothing after it.
> > the obvious solution would be to add a
"?" to the "[^pp'"]" to
> > make it optional but then almost everything will
match (for example
> > "aaaàãó").
> > That is the exact purpose of b which i cant
figure how to imitate.
>
> > Thanks!
>
> > On Aug 14, 12:09 pm, "syd...... gmail.com" <sydc... gmail.com> wrote:
>
> > > >> I need to imitate b - for
unicode..
>
> > > Hi Yonido,
>
> > > I jumped to a conclusion - bad one at that.
> > > Well, I didn't really know what Unicode was,
and now maybe I do
>
> > > I read through the superb pagehttp
://www.regular-expressions.info/unicode.html
> > > and stitched something that appears to work.
Please do visit this
> > > page. I cannot attempt to being an Unicode
expert after reading that!!
>
> > > Anyway, here are the results.
>
> > > SOURCE:
> > > "àãó" is OK
> > > "abàã" is not
> > > "Ö°A" is considered a match
> > > Ö°A
> > > àãó
> > > Ö°Ö°A
> > > Aàãó
> > > 'àãó'
>
> > > Pattern:
> > >
[^pp'"][u0080-u00FF]{2,}[^pp'"]
>
> > > Results:
> > > Match
> > > "àãó"
> > > àãó
> > > Ö°Ö°
> > > 'àãó'
>
> > > It is still picking up the " and '
> > > We could eliminate that by making our
expression [^pp'"]
> > > [u0080-u00FF]{2,}[^pp'"] but
that's your call...
>
> > > Please let me know if this is accurate or if
you had to tune this to
> > > work.
> > > I am sure you will be able to understand the
Unicode url and make this
> > > work. All the best!!
>
> > > Thanks!
> > > Syd- Hide quoted text -
>
> > - Show quoted text -- Hide quoted text -
>
> - Show quoted text -
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the
Google Groups "Regex" group.
To post to this group, send email to regex googlegroups.com
To unsubscribe from this group, send email to
regex-unsubscribe googlegroups.com
For more options, visit this group at http://groups.go
ogle.com/group/regex
-~----------~----~----~----~------~----~------~--~---
|
|
[1-7]
|
|