|
List Info
Thread: Am implenting backtracking for the parser-generator
|
|
| Am implenting backtracking for the
parser-generator |

|
2007-03-02 18:17:46 |
I have just started implementing backtracking for the parser
and would
like oppinions on how to go ahead with this.
I have just reached a state where the first simple grammar
compiles
parses stuff.
The idea:
The parser generator only looks one token forwards
(lalr(1)). This is
sometimes a bit annoying.
Instead of implementing a more general parser, the idea of
backtracking
is, that if you need to look more than 1 token ahead, you
simply make a
clever guess, and if the guess was wrong you backtrack to
the point, and
try something else.
Notice, that you only backtrack if your guess leads to a
compiler error,
so backtracking is NO MEANS of disambiguing a grammar. It is
a way of
implementing LALR(n), where n>1.
Backtracking (in my implementation, at least) works in the
following way:
Rules can be marked as "backtrackable". When the
parser generator meets
a conflict, and one of the involved rules is backtrackable,
that rule is
used, and is associated with a pointer to the other rule.
When the generated parser reduces a production that has a
backtrack-production associated, it pushes a
"backtracking mark" onto a
special "backtracking stack", and every symbol
processed from now on is
recorded on that backtracking mark.
If the parser meets an error, instead of reporting the
error, if the
backtracking stack is nonempty, it pops the top mark from
the
backtracking stack and reverts the state to the state when
the mark was
pushed onto the stack.
Instead of redoing the same production that lead to the mark
being
pushed onto the stack, it does the associated
backtrack-production, and
then it reprocesses all the symbols that was processed
between the mark
and the error. (These might lead to new elements being
pushed onto the
backtrack stack).
To avoid the possibility of a small error leading to a
backtrack right
back to the start of the program, productions can also be
marked as
"Stopping backtracks". These stops are safe
points, meaning, that if the
parser manages to reduce this production then everything up
to this
point is ok, and THE BACKTRACK STACK IS CLEARED.
This is basically it.
Because great care has been taken on keeping the parser
fast, and
backtracking ruins all this by going back and repeating
stuff,
backtracking should be kept at a minimum.
To reflect this, I have the following idea of how the syntax
for
backtracking should be:
Today definition of all rules is terminated by a full stop
(.). My
suggestion: A full stop means: Stop all backtracking (clear
backtrack
stack). Don't allow new backtrack on this production. This
way the
default is not to allow backtracking.
If you want to allow old backtracking history to keep on
going, but
won't allow this rule to start a new backtracking point,
this is marked
by replacing the full stop with a comma (,).
To allow old backtrackings to continue, and also add a new
backtrack
point for this rule, the rule is terminated by semicolon
(;).
Finally, to stop old backtrackings, but at the same time
inserting a new
backtrack point on the stack, the rule is terminated by
colon ( .
This is sort of intuitive. The lower part of the terminating
symbol (.
or ,) states whether old backtracks should be allowed, and
the top part
of it (. or nothing) states whether a new backtrack symbol
should be
allowed.
I have reached the "Hello-world" sortof state.
Haven't done much bug
testing, but simple grammars can be processed, the relevant
productions
are marked in the generated parser, and afaics the generated
parser
seems to work correctly.
P.t. I don't handle destructor code, because I could not
really
understand it.
Also I don't understand why the call to
"is_expected_token" was added to
the innermost loop of the parser. It was not in the
c-version, it looks
really expensive, and I don't really understand the code in
is_expected_token. So I commented it out.
Comments?
Should I post the whole thing onto this list?
Please cc me as I am not subscribed.
-Rune
--
PEAR Development Mailing List (http://pear.php.net/)
To unsubscribe, visit: http://www.php.net/unsub
.php
|
|
| Re: Am implenting backtracking for the
parser-generator |

|
2007-03-03 23:05:18 |
Hello Rune,
Rune Zedeler wrote:
> I have just started implementing backtracking for the
parser and would
> like oppinions on how to go ahead with this.
> I have just reached a state where the first simple
grammar compiles
> parses stuff.
>
> The idea:
> The parser generator only looks one token forwards
(lalr(1)). This is
> sometimes a bit annoying.
> Instead of implementing a more general parser, the idea
of backtracking
> is, that if you need to look more than 1 token ahead,
you simply make a
> clever guess, and if the guess was wrong you backtrack
to the point, and
> try something else.
> Notice, that you only backtrack if your guess leads to
a compiler error,
> so backtracking is NO MEANS of disambiguing a grammar.
It is a way of
> implementing LALR(n), where n>1.
>
> Backtracking (in my implementation, at least) works in
the following way:
> Rules can be marked as "backtrackable". When
the parser generator meets
> a conflict, and one of the involved rules is
backtrackable, that rule is
> used, and is associated with a pointer to the other
rule.
> When the generated parser reduces a production that has
a
> backtrack-production associated, it pushes a
"backtracking mark" onto a
> special "backtracking stack", and every
symbol processed from now on is
> recorded on that backtracking mark.
> If the parser meets an error, instead of reporting the
error, if the
> backtracking stack is nonempty, it pops the top mark
from the
> backtracking stack and reverts the state to the state
when the mark was
> pushed onto the stack.
> Instead of redoing the same production that lead to the
mark being
> pushed onto the stack, it does the associated
backtrack-production, and
> then it reprocesses all the symbols that was processed
between the mark
> and the error. (These might lead to new elements being
pushed onto the
> backtrack stack).
>
> To avoid the possibility of a small error leading to a
backtrack right
> back to the start of the program, productions can also
be marked as
> "Stopping backtracks". These stops are safe
points, meaning, that if the
> parser manages to reduce this production then
everything up to this
> point is ok, and THE BACKTRACK STACK IS CLEARED.
>
> This is basically it.
>
> Because great care has been taken on keeping the parser
fast, and
> backtracking ruins all this by going back and repeating
stuff,
> backtracking should be kept at a minimum.
> To reflect this, I have the following idea of how the
syntax for
> backtracking should be:
> Today definition of all rules is terminated by a full
stop (.). My
> suggestion: A full stop means: Stop all backtracking
(clear backtrack
> stack). Don't allow new backtrack on this production.
This way the
> default is not to allow backtracking.
> If you want to allow old backtracking history to keep
on going, but
> won't allow this rule to start a new backtracking
point, this is marked
> by replacing the full stop with a comma (,).
> To allow old backtrackings to continue, and also add a
new backtrack
> point for this rule, the rule is terminated by
semicolon (;).
> Finally, to stop old backtrackings, but at the same
time inserting a new
> backtrack point on the stack, the rule is terminated by
colon ( .
>
> This is sort of intuitive. The lower part of the
terminating symbol (.
> or ,) states whether old backtracks should be allowed,
and the top part
> of it (. or nothing) states whether a new backtrack
symbol should be
> allowed.
>
> I have reached the "Hello-world" sortof
state. Haven't done much bug
> testing, but simple grammars can be processed, the
relevant productions
> are marked in the generated parser, and afaics the
generated parser
> seems to work correctly.
>
> P.t. I don't handle destructor code, because I could
not really
> understand it.
> Also I don't understand why the call to
"is_expected_token" was added to
> the innermost loop of the parser. It was not in the
c-version, it looks
> really expensive, and I don't really understand the
code in
> is_expected_token. So I commented it out.
>
> Comments?
> Should I post the whole thing onto this list?
>
> Please cc me as I am not subscribed.
The best approach will be to open a feature request at
http://pear.php.net/bugs/report.php?package=PHP_P
arserGenerator
Then, after pear.php.net is updated in the next few days to
allow
patches to be added to the feature request, upload your
patch.
However, please don't comment out is_expected_token(). This
is
extremely important code that allows a syntax error to be
thrown at the
moment it occurs. It's not as expensive as you might think,
as expected
token list lookups are O(1) and the worst case is only
reached when
there really is a syntax error. The most common case is a
single call
to is_expected_token() for the majority of lookahead
tokens.
There were cases where invalid grammar was slipping right by
the
generated parsers, with no errors, and it has to do with the
weird way
Lemon was designed. Perhaps it worked in the C version, but
it most
certainly didn't in the PHP version.
My basic rule of thumb: don't comment out code just because
you don't
understand it
However, if you can demonstrate that your feature does not
add bloat and
is easy to test for correctness, I would definitely consider
adding it.
Also, do you have a sample grammar that demonstrates the
need for the
feature? It may be possible to optimize out the need for
backtracking
by simply performing the proper check for invalid grammar in
the action
code, which is much more efficient (no parser looping
necessary).
Thanks,
Greg
--
PEAR Development Mailing List (http://pear.php.net/)
To unsubscribe, visit: http://www.php.net/unsub
.php
|
|
| Re: Am implenting backtracking for the
parser-generator |

|
2007-03-04 13:02:13 |
Greg Beaver wrote:
> The best approach will be to open a feature request at
> http://pear.php.net/bugs/report.php?package=PHP_P
arserGenerator
Ok.
> However, please don't comment out is_expected_token().
This is
> extremely important code that allows a syntax error to
be thrown at the
> moment it occurs.
Ok. Can you post a simple grammar where it goes wrong when
it is
commented out?
> My basic rule of thumb: don't comment out code just
because you don't
> understand it
Of course. I didn't plan to leave it commented out
permanently.
I was just afraid of ruining something that I didn't really
understand,
so I commented it out instead of trying to get it to sort-of
work with
the backtracker without really unrestanding it.
> However, if you can demonstrate that your feature does
not add bloat and
> is easy to test for correctness, I would definitely
consider adding it.
Okay, great.
I will need some time to clean up my code.
I will also have to think about a good short example that
illustrates
the need.
All reasonable uses of backtracking will be LALR(n) for some
definite n,
and it can be shown that all LALR(n) languages are also
LALR(1). So the
backtracker does only allow for simpler and more straigt
forwards
grammars. It does not add languages that could not be parsed
before.
Oh, and the destructor code...
I could see no system in when the destructors are called. It
seems that
they are not called on the rhs-symbols when reducing a
rule.
Will constructors ever be used for anything in php? I would
vote for
removing the destructor code. Iirc, PHP already has its own
destructor
system, allowing for a destructor to be called when gc
erases an object.
afaics no need to put another system on top of that.
-Rune
--
PEAR Development Mailing List (http://pear.php.net/)
To unsubscribe, visit: http://www.php.net/unsub
.php
|
|
| Re: Am implenting backtracking for the
parser-generator |

|
2007-03-04 13:28:03 |
Greg Beaver wrote:
Here is a very simple grammar, where backtracking is really
nice.
Making a LALR(1)-grammar that accepts the same language will
basically
require you to merge the two kind of separators to one, and
make a
"date_or_interval" production that accepts from 2
to 6 numbers separated
by separators. Afterwards you would have to discard the
strings where
the separators were wrong.
Coming up with further extensions for the language, that
ruins this
approach and leeds to even more messy grammars is easy. For
instance one
could add a number-interval (NUMBER TO NUMBER). This would
not interfere
with the dates in the grammar as given here, because the TO
token cannot
appear in the outermost numbers. But it would interfere with
the
genreralised grammar outlines above.
%name SimpleExp_
%declare_class {class SimpleExp_yyParser}
%include {
$parser = new SimpleExp_yyParser;
$tokens['/'] = SimpleExp_yyParser::SLASH;
$tokens['-'] = SimpleExp_yyParser: ASH;
$tokens['1'] = SimpleExp_yyParser::NUMBER;
$tokens['t'] = SimpleExp_yyParser::TO;
$EOP = SimpleExp_yyParser::EOP;
$str = "1-1-1-1";
$parser->doParse($EOP,0);
for($i=0; $i< strlen($str); $i++) {
$token = $tokens[$str{$i}];
$parser->doParse($token,0);
}
$parser->doParse($EOP,0);
$parser->doParse(0,0);
}
%syntax_error {
foreach ($this->yy_get_expected_tokens($yymajor) as
$token) {
$expect[] = self::$yyTokenName[$token];
}
throw new Exception('Unexpected ' .
$this->tokenName($yymajor) .
'(' . $TOKEN
. '), expected one of: ' . implode(',', $expect));
}
%parse_accept {
echo "Accept!n";
}
start ::= EOP interval EOP.
start ::= EOP date EOP.
interval ::= date to date.
date ::= NUMBER datesep NUMBER.
date ::= NUMBER datesep NUMBER datesep NUMBER.
datesep ::= DASH.
datesep ::= SLASH.
to ::= DASH.
to ::= TO.
--
PEAR Development Mailing List (http://pear.php.net/)
To unsubscribe, visit: http://www.php.net/unsub
.php
|
|
[1-4]
|
|