Administrator
|
Given:
generationalPart := (#space asParser, generational) ==> #second. middleName := (#space asParser, (generational not, abbreviatableToken) ==> #second) ==> #second. lastName := (#space asParser, (generational not, token) ==> #second) ==> #second. and input := 'John Smith Jr'. The following parser fails: abbreviatableToken, middleName optional, lastName, generationalPart optional. But this one succeeds: (abbreviatableToken, middleName, lastName, generationalPart optional) / (abbreviatableToken, nil asParser, lastName, generationalPart optional) They look the same to me. What is the difference? Thanks! ----- Cheers, Sean -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
Cheers,
Sean |
Sean
I'm not an expert on PetitParser, but I think I understand what is happening. If I am right, I would expect the parser which fails to also fail, for the same reason, if the input is just 'John Smith' without the 'Jr'. If this is not so, you can disregard the rest of this post. The top-level construct in your parser is PPSequenceParser, which works in a simple-minded way; it just checks whether each of its component parsers succeeds. If one of them fails, the whole sequence fails; it does not try backtracking. (You can see the code at PPSequenceParser>>#parseOn:) In your case, the second component parser, which is 'middleName optional', succeeds, because 'Smith' could be a middle name. The next component, 'lastName', fails because 'Jr' is not a valid last name, but there is no way for the sequence parser to recall that the previous component had an optional element. So the sequence fails. The only way to cope with this that I can see is to make the options explicit by using the slash, which does show the parser where to backtrack to. This is what your second parser does. You could limit the scope of the backtracking to avoid re-parsing the first name, by writing something like: firstName, ((middleName, lastName)/ lastName), generational optional (I'm not sure whether the innermost parentheses are necessary, but at least they do no harm.) Thinking about this, I wondered how 'optional' could ever be used except at the end of a sequence. I think the answer is that it works if the optional token has a format or structure which identifies it uniquely if it does occur; in this case, the effect of 'optional' is to say 'forget it if it doesn't occur'. In your case, there is nothing to distinguish a middle name from a last name; indeed, I believe in US usage they can be the same - if Jane Smith marries John Doe, can she become Jane Smith Doe? If you are going to produce a parser which copes with all the vagaries of people's names, especially outside the US, I think you will have some fun. Many people in France would write a surname like yours with a space after the 'De', and probably a lower-case 'd' as well. In Scotland, a suffix like 'Jr' could appear as 'the Younger'. Some people have more than two forenames. Some people have double-barrelled surnames, with or without a hyphen. Those are just a few of the complications I can think of. So good luck! Hope this helps Peter Kenny -----Original Message----- From: Pharo-users [mailto:[hidden email]] On Behalf Of Sean P. DeNigris Sent: 21 September 2017 03:18 To: [hidden email] Subject: [Pharo-users] PetitParser Mystery Given: generationalPart := (#space asParser, generational) ==> #second. middleName := (#space asParser, (generational not, abbreviatableToken) ==> #second) ==> #second. lastName := (#space asParser, (generational not, token) ==> #second) ==> #second. and input := 'John Smith Jr'. The following parser fails: abbreviatableToken, middleName optional, lastName, generationalPart optional. But this one succeeds: (abbreviatableToken, middleName, lastName, generationalPart optional) / (abbreviatableToken, nil asParser, lastName, generationalPart optional) They look the same to me. What is the difference? Thanks! ----- Cheers, Sean -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html |
Administrator
|
Peter Kenny wrote
> I would expect the parser which fails to also > fail, for the same reason, if the input is just 'John Smith' without the > 'Jr'. Correct! It did. Peter Kenny wrote > The top-level construct in your parser is PPSequenceParser, which works in > a > simple-minded way; it just checks whether each of its component parsers > succeeds. If one of them fails, the whole sequence fails; it does not try > backtracking. Ah, okay. That makes sense. It seems I just got lucky in that when I've done this before, it was the special case you mention below of unique tokens Peter Kenny wrote > firstName, ((middleName, lastName)/ lastName), generational optional That worked for all interesting cases. Peter Kenny wrote > If you are going to produce a parser which copes with all the vagaries of > people's names, especially outside the US, I think you will have some fun. Ha ha, no doubt. I am not presuming to capture the whole domain, just an interesting - and thankfully very limited - subset! Thanks for all the help :) ----- Cheers, Sean -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
Cheers,
Sean |
Free forum by Nabble | Edit this page |