Thank you Martin,
your explanations got me some steps further. I found the page http://pdos.csail.mit.edu/~baford/packrat/ by Bryan Ford which seems to be a good entry point for reading about PEG parsing. That page refers to http://en.wikipedia.org/wiki/Parsing_expression_grammar as a good technical overview. I took the wiki example from http://code.google.com/p/xtreams/wiki/Parsing and adapted the names so that it fits the Squeak implementation A Parser is created from a PEG grammar by parsing the grammar rules using the ParserParser. wikiGrammar := PEGParser grammarWiki reading. wikiParser := PEGParser parserPEG parse: 'Grammar' stream: wikiGrammar actor: PEGParserParser new. input := 'Single paragraph with *bold* and _italic_ text and a [link]' reading. "As you indicated I used the PEGTreeBuilder" wikiParser parse: 'Page' stream: input actor: PEGTreeBuilder new If I explore the resulting object I got what is shown in the screen shot. I am not sure how to interpret the result. "Regarding the wiki example: the class PEGWikiGenerator is not included" wikiParser parse: 'Page' stream: input actor: PEGWikiGenerator new. Where can I get this class? The comment of the method #parse:stream:actor: of class PEGParser is helpful "Parse @definition from aStream using @anActor to act on the matching rules. "" definition <String> identifies the rule of the grammar to apply to the input aStream <ReadStream> the input to parse anActor <Actor> receives callbacks from the parser for the successfully rules with the matching the input ^ <Object> result of the actor action for the @definition " As class methods of PEGParser there are 8 grammars included - CSS3, HttpUrl, JSON, JavaScript, PEG, Smalltalk, Wiki, XML. I think a simple complete example which actually does something would be helpful for me. TreeBuilder seems to do nothing and I do not see the tree. For example besides the classes PEGParserPEGTest and PEGParserSmalltalkTest (6 test cases) a class PEGParserHttpUrlTest or PEGParserJavaScriptTest would be helpful. Actually I would like to parse some simple Javascript like mySent = new aswExample('') mySent.addWord('Ninatoka') mySent.currentWord().firstLetterUppercase=true mySent.currentWord().addMorpheme('ni-','S1s-') mySent.currentWord().addMorpheme('na-','Pres.-') mySent.currentWord().addMorpheme('tok','come_from') mySent.currentWord().addMorpheme('-a','-VF') mySent.addWord('Uswisi.') mySent.currentWord().addMorpheme('Uswisi','Switzerland.') mySent.writeHTML() It is used for web pages on which Swahili text is displayed. On request a morphological analysis is shown. A note: this example could be parsed with some manually constructed code without PEG I assume, just string matching and some branching. So there is no pressure on this. But I'd like to use it to learn PEG parsing. The goal is to translate the JavaScript code to Smalltalk code (and later one to an XML representation for printing). --Hannes On 1/21/11, [hidden email] <[hidden email]> wrote: > "Hannes Hirzel"<[hidden email]> wrote: >> Besides http://code.google.com/p/xtreams/wiki/Parsing are there more >> examples how to use Peg parsing? > > Unfortunately that one is still waiting for documentation. Even the little > bit that is on the wiki page uses an example that wasn't ported because it > uses VW's XML Elements. > > It's actually not that complicated to use. Basically you take a PEG grammar, > you parse it as the wiki example shows (the first 2 lines). Then you need an > 'actor' that will get callbacks from the parser when it parses some input. > The Actor class provides simple pragma based dispatch mechanism for > convenience, but ultimately the actor simply receives > #process:object:start:stop: callbacks from successfully applied grammar > rules. The TreeBuilder in the Tests package shows the simplest Actor that > just collects all the callback arguments. The return value of the callback > is important, the object: argument of the callback is a collection of what > the lower level rules returned from their callbacks. Anyway that's real > quick description of how the parsing mechanism works from user point of > view. Real documentation needs a lot of simple examples to make that > clearer, but hopefully this could get you going if you experiment with > simpler grammars and the TreeBuilder a bit. I'll try to answer any more > specific questions as I can. > > HTH, > > Martin > ExploreResultOfPEGwikiParser.PNG (55K) Download Attachment |
"Hannes Hirzel"<[hidden email]> wrote:
> If I explore the resulting object I got what is shown in the screen > shot. I am not sure how to interpret the result. Yes, it's not the best possible representation, but it does show the full parse tree. The tree can be rather complicated depending on how complicated is the grammar. That's why I suggested playing with simpler grammars. But the reason why I even mentioned the TreeBuilder is that it's the simplest possible Actor, it implements single method that packages the result of each successful production as it's own node. Maybe using an array instead of associations would create a slightly more "tree like" result. > "Regarding the wiki example: the class PEGWikiGenerator is not included" > > wikiParser parse: 'Page' stream: input actor: PEGWikiGenerator new. > > Where can I get this class? I've attached a fileout of the class that you can file into Squeak, except it references VW XML classes, so it won't work. If you fix it up to use the Squeak XML stuff you should be able to get it to work. All you really need is XML.Element and XML.Tag equivalent. The second fileout shows how the class variables are initialized in VW. You might need to turn the initializers into equivalent class side initialize methods. > As class methods of PEGParser there are 8 grammars included - CSS3, > HttpUrl, JSON, JavaScript, PEG, Smalltalk, Wiki, XML. > > I think a simple complete example which actually does something would > be helpful for me. TreeBuilder seems to do nothing and I do not see the tree. OK, let's try something really simple. The wikipedia page shows a simple arithmetic expression grammar. Here it is transcribed for Xtreams: grammar := ' Number <- [0-9]+ Parenthesised <- "(" Expr ")" Value <- Number / Parenthesised Product <- Value (("*" / "/") Value)* Sum <- Product (("+" / "-") Product)* Expr <- Sum'. I expanded the Value from the wiki grammar into separate Number and Parethesised expression productions. It'll hopefully be clear why shortly. You already know how to get a parser from grammar. That's always the same and for anything more complex you'll probably want to cache the parser instead of recreating it every time, but this will do fine for now: parser := PEG.Parser parserPEG parse: 'Grammar' stream: grammar actor: PEG.ParserParser new. With this you can already parse expressions (note that the grammar as it is written cannot handle whitespace) parser parse: 'Expr' stream: '3+4*(10-4)' reading actor: nil you'll get this (roughly formatted to provide at least some resemblance of the expression tree): #( #(OrderedCollection ($3 "16r0033") OrderedCollection ()) OrderedCollection (# ('+' #(OrderedCollection ($4 "16r0034") OrderedCollection (# ('*' #( '(' #( #(OrderedCollection ($1 "16r0031" $0 "16r0030") OrderedCollection ()) OrderedCollection (# ('-' #(OrderedCollection ($4 "16r0034") OrderedCollection ())))) ')'))))))) That's obviously not very easy to take apart to work with, that's why you want an Actor that will receive rule specific callbacks and do whatever. Let's say we want to evaluate the expression and return the result. We'll make a subclass of Actor called Evaluator. As I mentioned earlier, the Actor class provides pragma based mechanism for expressing rule specific callbacks so you simply write a method for each rule for which you want to do something and annotate it with a pragma. Let's first try to turn the digit collections into numbers: Evaluator>>Number: digits <action: 'Number'> ^digits inject: 0 into: [ :total :digit | total * 10 + ('0123456789' indexOf: digit) - 1 ] With this you can parse numbers parser parse: 'Expr' stream: '3456' reading actor: Evaluator new Now let's try to add. We'll want a method for the Sum production but there's a slightly tricky part there. Note that the rule is defined as Sum <- Product (("+" / "-") Product)* which means that you get a term (Product) and then zero or more pairs of + or - and another term. What the method receives will be collection of the first term and then pairs of the operator with its term. So parsing this: parser parse: 'Expr' stream: '3+4+5+6+7' reading actor: Evaluator new will give you #(3 OrderedCollection (#('+' 4) #('+' 5) #('+' 6) #('+' 7))) So if you add a method for Sum, that's what it will need to deal with: Sum: terms <action: 'Sum'> ^terms last inject: terms first into: [ :total :next | (next first = '+') ifTrue: [ total + next last ] ifFalse: [ total - next last ] ] The Actor pragmas provide another convenience for taking more complex input apart. You can write the above as Sum: first rest: pairs <action: 'Sum' arguments: #(1 2)> ^pairs inject: first into: [ :total :pair | (pair first = '+') ifTrue: [ total + pair last ] ifFalse: [ total - pair last ] ] The arguments: keyword simply says invoke this method with elements 1 and 2 of the rule input (terms) as arguments. Obviously the method must take the corresponding number of arguments as well. So with the above rules the addition expression should give you 25. The Product rule handling is analogous: Product: first rest: pairs <action: 'Product' arguments: #(1 2)> ^pairs inject: first into: [ :total :pair | (pair first = '*') ifTrue: [ total * pair last ] ifFalse: [ total / pair last ] ] Finally the parenthesised expression is rather simple: Parenthesised: expression <action: 'Parenthesised' arguments: #(2)> ^expression It receives a three element array: the left bracket, the expression inside and the right bracket. All we really need to do is return the expression in the middle. The above is using the arguments: pragma again to achieve that, the following would be equivalent: Parenthesised: expression <action: 'Parenthesised'> ^expression at: 2 I hope by now it's also clearer why it was useful to factor the Value definition out into separate Number and Parenthesised rule. Otherwise we'd have to handle both cases in a single Value: method, which would make the code messier. > For example besides the classes PEGParserPEGTest and > PEGParserSmalltalkTest (6 test cases) a class PEGParserHttpUrlTest or > PEGParserJavaScriptTest would be helpful. Yup, the tests are seriously wanting as well. Hopefully we'll get there soon. BTW, the parsing stuff doesn't work for me out of the box in either Squeak or Pharo in the images that I use (which should be quite clean). The problem I'm seeing is that the Dictionary>>addAll: in ParserParser>>Grammar: doesn't work as expected. It's trying to add collection of associations to a dictionary but ends up adding them as values indexed by integers. I can fix it but I wonder what do people have loaded that makes it work for them. Anyway, hopefully the above can get you going. Cheers, Martin |
On 1/23/11, [hidden email] <[hidden email]> wrote:
> OK, let's try something really simple. The wikipedia page shows a simple > arithmetic expression grammar. Hello Martin Thank you for working out the Wikipedia simple arithmetic expression grammar into a tutorial style text. I worked through it and understood it. I adapted the class names so that it runs in the current Squeak 4.2 out of the box. In addition I edited it slightly. I think it could be added as to the PEGParser class. Maybe somebody else like to check it in addition .... Additional example: Wiki text parsing In the previous mail you mentioned that you'll add the fileout of the class PEGWikiGenerator I did not get the fileout. --Hannes ============================================= PEGParser example Parsing simple arithmetic expressions The following may be copied into a workspace ============================================= "The wikipedia page shows a simple arithmetic expression grammar. Here it is transcribed for Xtreams:" simpleArithmeticExpressionGrammarDescription := ' Number <- [0-9]+ Parenthesised <- "(" Expr ")" Value <- Number / Parenthesised Product <- Value (("*" / "/") Value)* Sum <- Product (("+" / "-") Product)* Expr <- Sum '. "The Value from the wiki grammar was expanded into separate Number and Parethesised expression productions. It'll hopefully will be clear why shortly. You already know how to get a parser from grammar. That's always the same and for anything more complex you'll probably want to cache the parser instead of recreating it every time, but this will do fine for now:" parser := PEGParser parserPEG parse: 'Grammar' stream: simpleArithmeticExpressionGrammarDescription actor: PEGParserParser new. "If you parse the following expression and print the result you'll get" parser parse: 'Expr' stream: '3+4*(10-4)' reading actor: nil {{an OrderedCollection($3) . an OrderedCollection()} . an OrderedCollection({'+' . {an OrderedCollection($4) . an OrderedCollection({'*' . {'(' . {{an OrderedCollection($1 $0) . an OrderedCollection()} . an OrderedCollection({'-' . {an OrderedCollection($4) . an OrderedCollection()}})} . ')'}})}})} "which is roughly formatted manually to provide at least some resemblance of the expression tree" #( #(OrderedCollection ($3 "16r0033") OrderedCollection ()) OrderedCollection (# ('+' #(OrderedCollection ($4 "16r0034") OrderedCollection (# ('*' #( '(' #( #(OrderedCollection ($1 "16r0031" $0 "16r0030") OrderedCollection ()) OrderedCollection (# ('-' #(OrderedCollection ($4 "16r0034") OrderedCollection ())))) ')'))))))) "This not easy to take apart to work with. To do this you need an Actor that will receive rule specific callbacks and do whatever is needed. In this case we would like to evaluate the expression and return the result. We create a subclass of PEGActor called PEGSimpleArithmeticExpressionEvaluator. As we mentioned earlier, the Actor class provides a pragma based mechanism for expressing rule specific callbacks. So for programming the Evaluator class you simply need to write a method for each rule what should be done and annotate it with a pragma. Let's first fill in the 'Number' rule to deal convert the digit collections into numbers:" PEGSimpleArithmeticExpressionEvaluator>>Number: digits <action: 'Number'> ^digits inject: 0 into: [ :total :digit | total * 10 + ('0123456789' indexOf: digit) - 1 ] "With this you can parse numbers" parser parse: 'Expr' stream: '3456' reading actor: PEGSimpleArithmeticExpressionEvaluator new "Print it gives" {{3456 . an OrderedCollection()} . an OrderedCollection()} parser parse: 'Expr' stream: '3+4+5+6+7' reading actor: PEGSimpleArithmeticExpressionEvaluator new "So if you add a method for Sum, that's what it will need to deal with:" Sum: terms <action: 'Sum'> ^terms last inject: terms first into: [ :total :next | (next first = '+') ifTrue: [ total + next last ] ifFalse: [ total - next last ] ] "The Actor pragmas provide another convenience for taking more complex input apart. You can write the above as Sum: first rest: pairs <action: 'Sum' arguments: #(1 2)> ^pairs inject: first into: [ :total :pair | (pair first = '+') ifTrue: [ total + pair last ] ifFalse: [ total - pair last ] ] "The arguments: keyword simply says invoke this method with elements 1 and 2 of the rule input (terms) as arguments. Obviously the method must take the corresponding number of arguments as well. So with the above rules the addition expression should give you 25. " "The Product rule handling is analogous:" Product: first rest: pairs <action: 'Product' arguments: #(1 2)> ^pairs inject: first into: [ :total :pair | (pair first = '*') ifTrue: [ total * pair last ] ifFalse: [ total / pair last ] ] "Finally the parenthesised expression is rather simple:" Parenthesised: expression <action: 'Parenthesised' arguments: #(2)> ^expression "It receives a three element array: the left bracket, the expression inside and the right bracket. All we really need to do is return the expression in the middle. The above is using the arguments: pragma again to achieve that, the following would be equivalent:" Parenthesised: expression <action: 'Parenthesised'> ^expression at: 2 "By now is hopefully clearer why it was useful to factor the Value definition out into separate Number and Parenthesised rule. Otherwise we'd have to handle both cases in a single Value: method, which would make the code messier. A final parsing example" parser parse: 'Expr' stream: '((3+4)*2-4)/2' reading actor: PEGSimpleArithmeticExpressionEvaluator new 5 |
In reply to this post by Hannes Hirzel
"Hannes Hirzel"<[hidden email]> wrote:
> I worked through it and understood it. I adapted the class names so > that it runs in the current Squeak 4.2 out of the box. In addition I > edited it slightly. I think it could be added as to the PEGParser > class. Maybe somebody else like to check it in addition .... Yes, wrote it with the intent to eventually turn it into documentation for the Parsing package. On VW side we keep these in package comments, that's the most obvious location for it IMO. I'm not convinced that using a class comment on Squeak/Pharo side is a good choice. I would have difficulty myself to pick the class for it, so the chances that a new user will find it are even lower. I'm intrigued by the help facilities that were mentioned on Pharo side quite a bit recently. Is that something people hope to turn into the primary resource for these things ? Sounds like it would be quite easy to package it as that. The package comment are already written using the google wiki syntax, so that they can be directly posted to the google project site. So the help facilities allowing wiki syntax sound particularly attractive. Is this something that could work on Squeak side too, or do we need to do something different there ? In the meantime I intend to at least keep the google project pages up to date. > Additional example: Wiki text parsing > > In the previous mail you mentioned that you'll add the fileout of the class > > PEGWikiGenerator Ah, I got caught up in the arithmetic example and forgot the attachments. They are attached now. Sorry about that. Thanks for the edited version of the example. I'll work on it some more and turn it into a doc page on the site. If we figure out how to package it in Squeak and Pharo so that people can find it easily, I'd be happy to add all the documentation we have on the site. Cheers, Martin |
Free forum by Nabble | Edit this page |