List Info

Thread: Named-capture regex syntax




Named-capture regex syntax
user name
2006-12-26 01:33:43
SADAHIRO Tomoyuki schreef:

> +Currently NAME is restricted to simple identifiers
only.

I think it is important to also allow for a non
word-character separator
like "::", as a best effort to avoid clashes.

-- 
Affijn, Ruud

"Gewoon is een tijger."

Named-capture regex syntax
user name
2006-12-26 09:25:59
On 26/12/06, Dr.Ruud <rvtol+newsisolution.nl> wrote:
> SADAHIRO Tomoyuki schreef:
>
> > +Currently NAME is restricted to simple
identifiers only.
>
> I think it is important to also allow for a non
word-character separator
> like "::", as a best effort to avoid clashes.

Since %+ is restricted to the current dynamic scope, as are
$1, $2
etc., I don't see the need for this.
Named-capture regex syntax
user name
2006-12-26 17:30:32
"Rafael Garcia-Suarez" <rgarciasuarezgmail.com> wrote:
:On 26/12/06, Dr.Ruud <rvtol+newsisolution.nl> wrote:
:> SADAHIRO Tomoyuki schreef:
:>
:> > +Currently NAME is restricted to simple
identifiers only.
:>
:> I think it is important to also allow for a non
word-character separator
:> like "::", as a best effort to avoid
clashes.
:
:Since %+ is restricted to the current dynamic scope, as are
$1, $2
:etc., I don't see the need for this.

You may be constructing a regexp out of pieces, not all of
which are
a) under your control, nor b) predictable. Alternatively you
may be
supplying a fragment to someone else to incorporate in a
larger regexp.

Hugo
Named-capture regex syntax
user name
2006-12-26 17:58:29
On 12/26/06, hvcrypt.org <hvcrypt.org> wrote:
> "Rafael Garcia-Suarez" <rgarciasuarezgmail.com> wrote:
> :On 26/12/06, Dr.Ruud <rvtol+newsisolution.nl> wrote:
> :> SADAHIRO Tomoyuki schreef:
> :>
> :> > +Currently NAME is restricted to simple
identifiers only.
> :>
> :> I think it is important to also allow for a non
word-character separator
> :> like "::", as a best effort to avoid
clashes.
> :
> :Since %+ is restricted to the current dynamic scope,
as are $1, $2
> :etc., I don't see the need for this.
>
> You may be constructing a regexp out of pieces, not all
of which are
> a) under your control, nor b) predictable.
Alternatively you may be
> supplying a fragment to someone else to incorporate in
a larger regexp.

I dont see the problem with loosening the restrictions on
the names.
So long as values of the form /^[+-]d+$/ are not allowed
and so long
as it easy to find the end char. Having said that I dont see
the
original problem as that serious as '_' is allowed.

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"
Named-capture regex syntax
user name
2006-12-27 13:16:43
On Tue, 26 Dec 2006 17:30:32 +0000, hvcrypt.org wrote

> "Rafael Garcia-Suarez" <rgarciasuarezgmail.com> wrote:
> :On 26/12/06, Dr.Ruud <rvtol+newsisolution.nl> wrote:
> :> SADAHIRO Tomoyuki schreef:
> :>
> :> > +Currently NAME is restricted to simple
identifiers only.
> :>
> :> I think it is important to also allow for a non
word-character separator
> :> like "::", as a best effort to avoid
clashes.
> :
> :Since %+ is restricted to the current dynamic scope,
as are $1, $2
> :etc., I don't see the need for this.
> 
> You may be constructing a regexp out of pieces, not all
of which are
> a) under your control, nor b) predictable.
Alternatively you may be
> supplying a fragment to someone else to incorporate in
a larger regexp.

1.  Of course its syntax, (?<name>pat), (?'name'pat),
k<name>
and k'name', has a room to accept other characters for
names.
Namely even if no metacharacter would be allowed, a name in
angle
brackets may be any sequence of characters but greater-than
sign, and
single-quoted name may be any sequence of characters but
apostroph.


2.  Advantages of the restriction to simple identifiers are
i)  to avoid misreading that the name would relate to the
package
ii) to avoid mishandling unquoted hash key for %+

As shown below, complex identifiers like foo::bar aren't
safe when unquoted, and may cause the error:
   Bareword "foo::bar" not allowed while
"strict subs"

#!perl
sub A__B { "abc" }
sub A::B { "xyz" }
my %hash = qw( A::B colon   A__B bar   abc 100   xyz 200 );
print "unquoted: A::B=$hash{A::B},
A__B=>$hashn";
print "quoted:   A::B=$hash{'A::B'},
A__B=>$hash{'A__B'}n";
__END__
unquoted: A::B=200, A__B=>bar
quoted:   A::B=colon, A__B=>bar

Moreover, foo'bar, that cannot be accepted by (?'name'),
should
be equivalent to foo::bar as identifiers, while they should
differentiate as the hash key, $+{"foo::bar"} and
$+{"foo'bar"}.
These may be confused.

There is no need to make the name binding with the package.
If alphanumeric words wouldn't have enough namespace, more
arbitrary symbols should be available for the name.
But backward compatibility for look-behind assertions
(?<=some>)
and (?<!some>) should be kept as I had pointed out.


3.  To avoid namespace clash, the length of the name
shouldn't be
limited.  Whether :: are allowed or not, naming rule should
prevent
namespace clash.  Certainly :: may bring forward the
recommendation
that module authors should prefix names with the module
name,
but that is no need if the module won't provide users some
fragment
patterns, since names have already been localized:

#!perl
   "bar" =~ /(?<name>bar)/;
    print "1: $+n";
{
   "foo" =~ /(?<name>foo)/;
    print "2: $+n";
}
    print "3: $+n";
__END__
1: bar
2: foo
3: bar


4.  When a fragment pattern is embedded in a base pattern,
they may share same name for the capture.
Though any pattern having only k<name> can't be
compiled,
a pattern having only (?<name>) can be compiled.

Hence there is a case that the fragment pattern has
k<name>
and the base pattern has (?<name>), say:

my $a = qr/(?<name>w w)/;
print "a b a c a c b" =~ /($a k<name>)/ ?
$1 : "not match";
__END__
a c a c

In another case, the fragment pattern may have both
(?<name>)
and k<name>. In both cases, it should be allowed the
base pattern
to have k<name> that has same name as one in the
fragment pattern.

In contrast, it should *not* be allowed the base pattern to
have (?<name>) that has same name as one in the
fragment pattern.
This means (?<name>) is redefined between both
patterns.

In the following example, if name in $b would be anything
but foo,
both matches should success. But actually redefinition of
names
made the latter match fail.

$b = qr/(?<foo>b)/;
print "a b a" =~ /((?<foo>a) $b
k<foo>)/ ? $1 : "not match",
"n";
print "b a a" =~ /($b (?<foo>a)
k<foo>)/ ? $1 : "not match",
"n";
__END__
a b a
not match

I think people writing /$b (?<foo>a) k<foo>/
should expect
k<foo> will refer to their own (?<foo>) but not
to other in $b,
while people writing /$a k<bar>/ expects $a has
(?<bar>).

Perhaps the redefinition of same name in a single pattern
should
cause a fatal error.

Regards,
SADAHIRO Tomoyuki


Named-capture regex syntax
user name
2006-12-27 14:04:59
On 12/27/06, SADAHIRO Tomoyuki <bqw10602nifty.com> wrote:
>
> On Tue, 26 Dec 2006 17:30:32 +0000, hvcrypt.org
wrote
>
> > "Rafael Garcia-Suarez"
<rgarciasuarezgmail.com> wrote:
> > :On 26/12/06, Dr.Ruud <rvtol+newsisolution.nl> wrote:
> > :> SADAHIRO Tomoyuki schreef:
> > :>
> > :> > +Currently NAME is restricted to simple
identifiers only.
> > :>
> > :> I think it is important to also allow for a
non word-character separator
> > :> like "::", as a best effort to
avoid clashes.
> > :
> > :Since %+ is restricted to the current dynamic
scope, as are $1, $2
> > :etc., I don't see the need for this.
> >
> > You may be constructing a regexp out of pieces,
not all of which are
> > a) under your control, nor b) predictable.
Alternatively you may be
> > supplying a fragment to someone else to
incorporate in a larger regexp.
>
> 1.  Of course its syntax, (?<name>pat),
(?'name'pat), k<name>
> and k'name', has a room to accept other characters for
names.
> Namely even if no metacharacter would be allowed, a
name in angle
> brackets may be any sequence of characters but
greater-than sign, and
> single-quoted name may be any sequence of characters
but apostroph.

Well, they also can't look like numbers, and IMO the
identifier should
be safely and unambiguously quoted regardless of whether its
being
declared or referenced by a (?<foo>...) (?'foo'...)
k<foo> k'foo'
k g. Im personally not so fussed about whether
quotes are
required when doing $+.

Id say something like
/^[^'<>{}=!d[:cntrl:]+-][^'>}[:cntrl:]]+$/
would mostly cover it.

>
>
> 2.  Advantages of the restriction to simple identifiers
are
> i)  to avoid misreading that the name would relate to
the package

Well, for me the advantages are purely that its simple to
parse, and
unambiguous regardless as to how its expressed. Im thinking
that if
someone is mechanistically generating patterns then they
shouldnt need
to worry about quoting rules or ambiguity.
This is especially important with the special behaviour of
g{...}.

> ii) to avoid mishandling unquoted hash key for %+

Important i guess, but not very. People are imo used to this
problem.

> > 4.  When a fragment pattern is embedded in a base
pattern,
> they may share same name for the capture.
> Though any pattern having only k<name> can't be
compiled,
> a pattern having only (?<name>) can be compiled.

Interesting point. Should k<unknown> be compilable
and treated as a
NOTHING regop?

It might make it easier make pattern fragment libraries.

> Hence there is a case that the fragment pattern has
k<name>
> and the base pattern has (?<name>), say:
>
> my $a = qr/(?<name>w w)/;
> print "a b a c a c b" =~ /($a
k<name>)/ ? $1 : "not match";
> __END__
> a c a c
>
> In another case, the fragment pattern may have both
(?<name>)
> and k<name>. In both cases, it should be allowed
the base pattern
> to have k<name> that has same name as one in the
fragment pattern.
>
> In contrast, it should *not* be allowed the base
pattern to
> have (?<name>) that has same name as one in the
fragment pattern.
> This means (?<name>) is redefined between both
patterns.

I dont really agree with this actually. You have to
distinguish
between two cases:

 / (?<foo>abc) (?<foo>def) /x

and

 / (?<foo>abc) | (?<foo>def) /x

In the first case there might be an issue, but in the second
it
probably wont, and allowing the second is a lot more
important than
catching the first. And since catching the first without
blocking the
second is a pain in the ass we end up with the current
implementation.

Additionally when I did a talk on regexps at the LPW there
was
interest in being able to have multiple buffers with the
same name and
to be able to access them all. For instance it was suggested
that %-
could be set up to return an array reference with the values
of the
buffers associated to that name in order they occur in the
pattern. Im
not sure how this would work in terms of named back refs,
but I could
see how combined with relative backrefs the ability to do

   if (/..(?<name>...)...(?<name>...).../) {
      for my $names ({$-}) {
      }
   }

could be quite powerful. Supporting this is on my todo list,
but doing
tied hashes in XS is a pain and i keep procrastinating. :-(

> I think people writing /$b (?<foo>a)
k<foo>/ should expect
> k<foo> will refer to their own (?<foo>)
but not to other in $b,
> while people writing /$a k<bar>/ expects $a has
(?<bar>).

I think these expectations would ultimately be too
restrictive. If
they really expect that then they should do (??{$b}) which
will have
the effect you decribe.

I guess the issue is the current way is poweful, but allows
you to
blow your foot off if you arent careful, but thats the price
of power.

cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"
[1-6]

about | contact  Other archives ( Real Estate discussion Medical topics )