PetitParser Mystery

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

PetitParser Mystery

Sean P. DeNigris
Administrator
Given:
        generationalPart := (#space asParser, generational) ==> #second.
        middleName := (#space asParser, (generational not, abbreviatableToken) ==>
#second) ==> #second.
        lastName := (#space asParser, (generational not, token) ==> #second) ==>
#second.
and
        input := 'John Smith Jr'.

The following parser fails:
        abbreviatableToken, middleName optional, lastName, generationalPart
optional.

But this one succeeds:
        (abbreviatableToken, middleName, lastName, generationalPart optional) /
(abbreviatableToken, nil asParser, lastName, generationalPart optional)

They look the same to me. What is the difference?

Thanks!



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: PetitParser Mystery

Peter Kenny
Sean

I'm not an expert on PetitParser, but I think I understand what is
happening. If I am right, I would expect the parser which fails to also
fail, for the same reason, if the input is just 'John Smith' without the
'Jr'. If this is not so, you can disregard the rest of this post.

The top-level construct in your parser is PPSequenceParser, which works in a
simple-minded way; it just checks whether each of its component parsers
succeeds. If one of them fails, the whole sequence fails; it does not try
backtracking. (You can see the code at PPSequenceParser>>#parseOn:) In your
case, the second component parser, which is 'middleName optional', succeeds,
because 'Smith' could be a middle name. The next component, 'lastName',
fails because 'Jr' is not a valid last name, but there is no way for the
sequence parser to recall that the previous component had an optional
element. So the sequence fails.

The only way to cope with this that I can see is to make the options
explicit by using the slash, which does show the parser where to backtrack
to. This is what your second parser does. You could limit the scope of the
backtracking to avoid re-parsing the first name, by writing something like:

firstName, ((middleName, lastName)/ lastName), generational optional

(I'm not sure whether the innermost parentheses are necessary, but at least
they do no harm.)

Thinking about this, I wondered how 'optional' could ever be used except at
the end of a sequence. I think the answer is that it works if the optional
token has a format or structure which identifies it uniquely if it does
occur; in this case, the effect of 'optional' is to say 'forget it if it
doesn't occur'. In your case, there is nothing to distinguish a middle name
from a last name; indeed, I believe in US usage they can be the same - if
Jane Smith marries John Doe, can she become Jane Smith Doe?

If you are going to produce a parser which copes with all the vagaries of
people's names, especially outside the US, I think you will have some fun.
Many people in France would write a surname like yours with a space after
the 'De', and probably a lower-case 'd' as well. In Scotland, a suffix like
'Jr' could appear as 'the Younger'. Some people have more than two
forenames. Some people have double-barrelled surnames, with or without a
hyphen. Those are just a few of the complications I can think of. So good
luck!

Hope this helps

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of
Sean P. DeNigris
Sent: 21 September 2017 03:18
To: [hidden email]
Subject: [Pharo-users] PetitParser Mystery

Given:
        generationalPart := (#space asParser, generational) ==> #second.
        middleName := (#space asParser, (generational not,
abbreviatableToken) ==>
#second) ==> #second.
        lastName := (#space asParser, (generational not, token) ==> #second)
==> #second.
and
        input := 'John Smith Jr'.

The following parser fails:
        abbreviatableToken, middleName optional, lastName, generationalPart
optional.

But this one succeeds:
        (abbreviatableToken, middleName, lastName, generationalPart
optional) / (abbreviatableToken, nil asParser, lastName, generationalPart
optional)

They look the same to me. What is the difference?

Thanks!



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html


Reply | Threaded
Open this post in threaded view
|

Re: PetitParser Mystery

Sean P. DeNigris
Administrator
Peter Kenny wrote
> I would expect the parser which fails to also
> fail, for the same reason, if the input is just 'John Smith' without the
> 'Jr'.

Correct! It did.


Peter Kenny wrote
> The top-level construct in your parser is PPSequenceParser, which works in
> a
> simple-minded way; it just checks whether each of its component parsers
> succeeds. If one of them fails, the whole sequence fails; it does not try
> backtracking.

Ah, okay. That makes sense. It seems I just got lucky in that when I've done
this before, it was the special case you mention below of unique tokens


Peter Kenny wrote
> firstName, ((middleName, lastName)/ lastName), generational optional

That worked for all interesting cases.


Peter Kenny wrote
> If you are going to produce a parser which copes with all the vagaries of
> people's names, especially outside the US, I think you will have some fun.

Ha ha, no doubt. I am not presuming to capture the whole domain, just an
interesting - and thankfully very limited - subset!

Thanks for all the help :)



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Cheers,
Sean