Re: [squeak-dev] Xtreams up to date

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Xtreams up to date

Hannes Hirzel
Thank you Martin,

your explanations got me some steps further.

I found the page http://pdos.csail.mit.edu/~baford/packrat/ by Bryan
Ford which seems to be a good entry point for reading about PEG
parsing. That page refers to
http://en.wikipedia.org/wiki/Parsing_expression_grammar as a good
technical overview.

I took the wiki example from
http://code.google.com/p/xtreams/wiki/Parsing and adapted the names so
that it fits the Squeak implementation


A Parser is created from a PEG grammar by parsing the grammar rules
using the ParserParser.

        wikiGrammar := PEGParser grammarWiki reading.
        wikiParser := PEGParser parserPEG parse: 'Grammar' stream:
wikiGrammar actor: PEGParserParser new.


        input := 'Single paragraph with *bold* and _italic_ text and a
[link]' reading.

"As you indicated I used the PEGTreeBuilder"
          wikiParser parse: 'Page' stream: input actor: PEGTreeBuilder new

If I explore the resulting object I got what is shown in the screen
shot. I am not sure how to interpret the result.



"Regarding the wiki example: the class PEGWikiGenerator is not included"

        wikiParser parse: 'Page' stream: input actor: PEGWikiGenerator new.

Where can I get this class?


The comment of the method  #parse:stream:actor: of class PEGParser is helpful

"Parse @definition from aStream using @anActor to act on the matching rules.
        "" definition <String> identifies the rule of the grammar to apply
to the input
                aStream <ReadStream> the input to parse
                anActor <Actor> receives callbacks from the parser for the
successfully rules with the matching the input
                ^ <Object> result of the actor action for the @definition
        "


As class methods of PEGParser there are 8 grammars included - CSS3,
HttpUrl, JSON, JavaScript, PEG, Smalltalk, Wiki, XML.

I think a simple complete example which actually does something would
be helpful for me.
TreeBuilder  seems to do nothing and I do not see the tree.

For example besides the classes PEGParserPEGTest and
PEGParserSmalltalkTest (6 test cases) a class PEGParserHttpUrlTest or
PEGParserJavaScriptTest would be helpful.


Actually I would like to parse some simple Javascript like

                                mySent = new aswExample('')
                                mySent.addWord('Ninatoka')
                                mySent.currentWord().firstLetterUppercase=true
                                mySent.currentWord().addMorpheme('ni-','S1s-')
                                mySent.currentWord().addMorpheme('na-','Pres.-')

mySent.currentWord().addMorpheme('tok','come_from')
                                mySent.currentWord().addMorpheme('-a','-VF')
                                mySent.addWord('Uswisi.')

mySent.currentWord().addMorpheme('Uswisi','Switzerland.')
                                mySent.writeHTML()

It is used for web pages on which Swahili text is displayed. On
request a morphological analysis is shown.

A note: this example could be parsed with some manually constructed
code without PEG I assume, just string matching and some branching. So
there is no pressure on this. But I'd like to use it to learn PEG
parsing. The goal is to translate the JavaScript code to Smalltalk
code (and later one to an XML representation for printing).


--Hannes



On 1/21/11, [hidden email] <[hidden email]> wrote:

> "Hannes Hirzel"<[hidden email]> wrote:
>> Besides http://code.google.com/p/xtreams/wiki/Parsing are there more
>> examples how to use Peg parsing?
>
> Unfortunately that one is still waiting for documentation. Even the little
> bit that is on the wiki page uses an example that wasn't ported because it
> uses VW's XML Elements.
>
> It's actually not that complicated to use. Basically you take a PEG grammar,
> you parse it as the wiki example shows (the first 2 lines). Then you need an
> 'actor' that will get callbacks from the parser when it parses some input.
> The Actor class provides simple pragma based dispatch mechanism for
> convenience, but ultimately the actor simply receives
> #process:object:start:stop: callbacks from successfully applied grammar
> rules. The TreeBuilder in the Tests package shows the simplest Actor that
> just collects all the callback arguments. The return value of the callback
> is important, the object: argument of the callback is a collection of what
> the lower level rules returned from their callbacks. Anyway that's real
> quick description of how the parsing mechanism works from user point of
> view. Real documentation needs a lot of simple examples to make that
> clearer, but hopefully this could get you going if you experiment with
> simpler grammars and the TreeBuilder a bit. I'll try to answer any more
> specific questions as I can.
>
> HTH,
>
> Martin
>

ExploreResultOfPEGwikiParser.PNG (55K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Xtreams up to date

mkobetic
"Hannes Hirzel"<[hidden email]> wrote:
> If I explore the resulting object I got what is shown in the screen
> shot. I am not sure how to interpret the result.

Yes, it's not the best possible representation, but it does show the full parse tree. The tree can be rather complicated depending on how complicated is the grammar. That's why I suggested playing with simpler grammars. But the reason why I even mentioned the TreeBuilder is that it's the simplest possible Actor, it implements single method that packages the result of each successful production as it's own node. Maybe using an array instead of associations would create a slightly more "tree like" result.

> "Regarding the wiki example: the class PEGWikiGenerator is not included"
>
>         wikiParser parse: 'Page' stream: input actor: PEGWikiGenerator new.
>
> Where can I get this class?

I've attached a fileout of the class that you can file into Squeak, except it references VW XML classes, so it won't work. If you fix it up to use the Squeak XML stuff you should be able to get it to work. All you really need is XML.Element and XML.Tag equivalent. The second fileout shows how the class variables are initialized in VW. You might need to turn the initializers into equivalent class side initialize methods.

> As class methods of PEGParser there are 8 grammars included - CSS3,
> HttpUrl, JSON, JavaScript, PEG, Smalltalk, Wiki, XML.
>
> I think a simple complete example which actually does something would
> be helpful for me. TreeBuilder  seems to do nothing and I do not see the tree.

OK, let's try something really simple. The wikipedia page shows a simple arithmetic expression grammar. Here it is transcribed for Xtreams:

        grammar := '
                Number <- [0-9]+
                Parenthesised <- "(" Expr ")"
                Value <- Number / Parenthesised
                Product <- Value (("*" / "/") Value)*
                Sum <- Product (("+" / "-") Product)*
                Expr <- Sum'.

I expanded the Value from the wiki grammar into separate Number and Parethesised expression productions. It'll hopefully be clear why shortly. You already know how to get a parser from grammar. That's always the same and for anything more complex you'll probably want to cache the parser instead of recreating it every time, but this will do fine for now:

        parser := PEG.Parser parserPEG parse: 'Grammar' stream: grammar actor: PEG.ParserParser new.

With this you can already parse expressions (note that the grammar as it is written cannot handle whitespace)

        parser parse: 'Expr' stream: '3+4*(10-4)' reading actor: nil

you'll get this (roughly formatted to provide at least some resemblance of the expression tree):

        #( #(OrderedCollection ($3 "16r0033") OrderedCollection ())
                OrderedCollection (#
                        ('+' #(OrderedCollection ($4 "16r0034")
                        OrderedCollection (#
                                ('*' #(
                                        '(' #( #(OrderedCollection ($1 "16r0031" $0 "16r0030") OrderedCollection ())
                                                        OrderedCollection (#
                                                        ('-' #(OrderedCollection ($4 "16r0034") OrderedCollection ())))) ')')))))))

That's obviously not very easy to take apart to work with, that's why you want an Actor that will receive rule specific callbacks and do whatever. Let's say we want to evaluate the expression and return the result. We'll make a subclass of Actor called Evaluator. As I mentioned earlier, the Actor class provides pragma based mechanism for expressing rule specific callbacks so you simply write a method for each rule for which you want to do something and annotate it with a pragma. Let's first try to turn the digit collections into numbers:

Evaluator>>Number: digits
        <action: 'Number'>
        ^digits inject: 0 into: [ :total :digit | total * 10 + ('0123456789' indexOf: digit) - 1 ]


With this you can parse numbers

        parser parse: 'Expr' stream: '3456' reading actor: Evaluator new

Now let's try to add. We'll want a method for the Sum production but there's a slightly tricky part there. Note that the rule is defined as

        Sum <- Product (("+" / "-") Product)*

which means that you get a term (Product) and then zero or more pairs of + or - and another term. What the method receives will be collection of the first term and then pairs of the operator with its term. So parsing this:

        parser parse: 'Expr' stream: '3+4+5+6+7' reading actor: Evaluator new

will give you

        #(3 OrderedCollection (#('+' 4) #('+' 5) #('+' 6) #('+' 7)))

So if you add a method for Sum, that's what it will need to deal with:

Sum: terms
        <action: 'Sum'>
        ^terms last inject: terms first into: [ :total :next |
                (next first = '+') ifTrue: [ total + next last ] ifFalse: [ total - next last ] ]

The Actor pragmas provide another convenience for taking more complex input apart. You can write the above as

Sum: first rest: pairs
        <action: 'Sum' arguments: #(1 2)>
        ^pairs inject: first into: [ :total :pair |
                (pair first = '+') ifTrue: [ total + pair last ] ifFalse: [ total - pair last ] ]

The arguments: keyword simply says invoke this method with elements 1 and 2 of the rule input (terms) as arguments. Obviously the method must take the corresponding number of arguments as well. So with the above rules the addition expression should give you 25. The Product rule handling is analogous:

Product: first rest: pairs
        <action: 'Product' arguments: #(1 2)>
        ^pairs inject: first into: [ :total :pair |
                (pair first = '*') ifTrue: [ total * pair last ] ifFalse: [ total / pair last ] ]

Finally the parenthesised expression is rather simple:

Parenthesised: expression
        <action: 'Parenthesised' arguments: #(2)>
        ^expression

It receives a three element array: the left bracket, the expression inside and the right bracket. All we really need to do is return the expression in the middle. The above is using the arguments: pragma again to achieve that, the following would be equivalent:

Parenthesised: expression
        <action: 'Parenthesised'>
        ^expression at: 2

I hope by now it's also clearer why it was useful to factor the Value definition out into separate Number and Parenthesised rule. Otherwise we'd have to handle both cases in a single Value: method, which would make the code messier.

> For example besides the classes PEGParserPEGTest and
> PEGParserSmalltalkTest (6 test cases) a class PEGParserHttpUrlTest or
> PEGParserJavaScriptTest would be helpful.

Yup, the tests are seriously wanting as well. Hopefully we'll get there soon. BTW, the parsing stuff doesn't work for me out of the box in either Squeak or Pharo in the images that I use (which should be quite clean). The problem I'm seeing is that the Dictionary>>addAll: in ParserParser>>Grammar: doesn't work as expected. It's trying to add collection of associations to a dictionary but ends up adding them as values indexed by integers. I can fix it but I wonder what do people have loaded that makes it work for them.

Anyway, hopefully the above can get you going.

Cheers,

Martin

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Xtreams up to date

Hannes Hirzel
On 1/23/11, [hidden email] <[hidden email]> wrote:
> OK, let's try something really simple. The wikipedia page shows a simple
> arithmetic expression grammar.

Hello Martin

Thank you for working out the Wikipedia simple arithmetic expression
grammar into a tutorial style text.

I worked through it and understood it. I adapted the class names so
that it runs in the current Squeak 4.2 out of the box. In addition I
edited it slightly. I think it could be added as to the PEGParser
class. Maybe somebody else like to check it in addition ....


Additional example: Wiki text parsing

In the previous mail you mentioned that you'll add the fileout of the class

   PEGWikiGenerator

I did not get the fileout.

--Hannes


=============================================
PEGParser example

Parsing simple arithmetic expressions

The following may be copied into a workspace
=============================================


"The wikipedia page shows a simple arithmetic expression grammar. Here
it is transcribed for Xtreams:"

simpleArithmeticExpressionGrammarDescription := '

               Number                <- [0-9]+
               Parenthesised      <- "(" Expr ")"
               Value                   <- Number / Parenthesised
               Product                <- Value (("*" / "/") Value)*
               Sum                     <- Product (("+" / "-") Product)*
               Expr                     <- Sum

                '.

"The Value from the wiki grammar was expanded into separate Number and
Parethesised expression productions. It'll hopefully will be clear why
shortly.

You already know how to get a parser from grammar. That's always the
same and for anything more complex you'll probably want to cache the
parser instead of recreating it every time, but this will do fine for
now:"


parser := PEGParser parserPEG
                                parse: 'Grammar'
                                stream: simpleArithmeticExpressionGrammarDescription
                                actor: PEGParserParser new.
                               
                               
"If you parse the following expression and print the result you'll get"


 parser parse: 'Expr' stream: '3+4*(10-4)' reading actor: nil


{{an OrderedCollection($3) . an OrderedCollection()} . an
OrderedCollection({'+' . {an OrderedCollection($4) . an
OrderedCollection({'*' . {'(' . {{an OrderedCollection($1 $0) . an
OrderedCollection()} . an OrderedCollection({'-' . {an
OrderedCollection($4) . an OrderedCollection()}})} . ')'}})}})}

"which is roughly formatted manually to provide at least some
resemblance of the expression tree"


       #(      #(OrderedCollection ($3 "16r0033") OrderedCollection ())
               OrderedCollection (#
                       ('+' #(OrderedCollection ($4 "16r0034")
                       OrderedCollection (#
                               ('*' #(
                                       '(' #(  #(OrderedCollection ($1
"16r0031" $0 "16r0030") OrderedCollection ())
                                                       OrderedCollection (#
                                                       ('-'
#(OrderedCollection ($4 "16r0034") OrderedCollection ())))) ')')))))))


"This not easy to take apart to work with. To do this you need an
Actor that will receive rule specific callbacks and do whatever is
needed.

In this case we would like to evaluate the expression and return the result.

We create a subclass of PEGActor called
PEGSimpleArithmeticExpressionEvaluator. As we mentioned earlier, the
Actor class provides a pragma based mechanism for expressing rule
specific callbacks. So for programming the Evaluator class you simply
need to write a method for each rule what should be done and annotate
it with a pragma.

Let's first fill in the 'Number' rule to deal convert the digit
collections into numbers:"

PEGSimpleArithmeticExpressionEvaluator>>Number: digits
       <action: 'Number'>
       ^digits inject: 0 into: [ :total :digit | total * 10 +
('0123456789' indexOf: digit) - 1 ]



"With this you can parse numbers"

       parser parse: 'Expr' stream: '3456' reading actor:
PEGSimpleArithmeticExpressionEvaluator new

"Print it gives"
{{3456 . an OrderedCollection()} . an OrderedCollection()}


        parser parse: 'Expr' stream: '3+4+5+6+7' reading actor:
PEGSimpleArithmeticExpressionEvaluator new


"So if you add a method for Sum, that's what it will need to deal with:"

Sum: terms
       <action: 'Sum'>
       ^terms last inject: terms first into: [ :total :next |
               (next first = '+') ifTrue: [ total + next last ]
ifFalse: [ total - next last ] ]



"The Actor pragmas provide another convenience for taking more complex
input apart. You can write the above as

Sum: first rest: pairs
       <action: 'Sum' arguments: #(1 2)>
       ^pairs inject: first into: [ :total :pair |
               (pair first = '+') ifTrue: [ total + pair last ]
ifFalse: [ total - pair last ] ]

"The arguments: keyword simply says invoke this method with elements 1
and 2 of the rule input (terms) as arguments. Obviously the method
must take the corresponding number of arguments as well. So with the
above rules the addition expression should give you 25. "



"The Product rule handling is analogous:"

Product: first rest: pairs
       <action: 'Product' arguments: #(1 2)>
       ^pairs inject: first into: [ :total :pair |
               (pair first = '*') ifTrue: [ total * pair last ]
ifFalse: [ total / pair last ] ]



"Finally the parenthesised expression is rather simple:"

Parenthesised: expression
       <action: 'Parenthesised' arguments: #(2)>
       ^expression

"It receives a three element array: the left bracket, the expression
inside and the right bracket. All we really need to do is return the
expression in the middle. The above is using the arguments: pragma
again to achieve that, the following would be equivalent:"

Parenthesised: expression
       <action: 'Parenthesised'>
       ^expression at: 2


"By now is hopefully clearer why it was useful to factor the Value
definition out into separate Number and Parenthesised rule. Otherwise
we'd have to handle both cases in a single Value: method, which would
make the code messier.


A final parsing example"


parser parse: 'Expr' stream: '((3+4)*2-4)/2' reading actor:
PEGSimpleArithmeticExpressionEvaluator new

5

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Xtreams up to date

mkobetic
In reply to this post by Hannes Hirzel
"Hannes Hirzel"<[hidden email]> wrote:
> I worked through it and understood it. I adapted the class names so
> that it runs in the current Squeak 4.2 out of the box. In addition I
> edited it slightly. I think it could be added as to the PEGParser
> class. Maybe somebody else like to check it in addition ....

Yes, wrote it with the intent to eventually turn it into documentation for the Parsing package. On VW side we keep these in package comments, that's the most obvious location for it IMO. I'm not convinced that using a class comment on Squeak/Pharo side is a good choice. I would have difficulty myself to pick the class for it, so the chances that a new user will find it are even lower. I'm intrigued by the help facilities that were mentioned on Pharo side quite a bit recently. Is that something people hope to turn into the primary resource for these things ? Sounds like it would be quite easy to package it as that. The package comment are already written using the google wiki syntax, so that they can be directly posted to the google project site. So the help facilities allowing wiki syntax sound particularly attractive. Is this something that could work on Squeak side too, or do we need to do something different there ? In the meantime I intend to at least keep the google project pages up to date.

> Additional example: Wiki text parsing
>
> In the previous mail you mentioned that you'll add the fileout of the class
>
>    PEGWikiGenerator

Ah, I got caught up in the arithmetic example and forgot the attachments. They are attached now. Sorry about that.

Thanks for the edited version of the example. I'll work on it some more and turn it into a doc page on the site. If we figure out how to package it in Squeak and Pharo so that people can find it easily, I'd be happy to add all the documentation we have on the site.

Cheers,

Martin

WikiGenerator.st (7K) Download Attachment
PEGWikiGenerator-xhtml-tags.st (4K) Download Attachment