Ok I have read a few times of the tutorial of Smacc in here http://www.refactoryworkers.com/SmaCC/ASTs.html
and I am also following the help tool documentation inside pharo for SmaCC and I have to say I am very confused. Please bare with me because I am extremely noob when it comes to parsing, this is my first effort. Now the way I understand it so far, is that SmaCC uses a syntax similar to regex expressions to define parsers and scanners. Scanners evaluate a string to see that it contains a valid form and parser divide to parts named as "tokens" and help in the creating of ASTs which are basically hierarchy tree containing the syntax elements of a language. Now in order to make SmaCC work I need to use the SmaCC tool that comes with pharo . The smacc tool takes two inputs a scanner and a parser class. Does that mean I need to create that parser and scanner class ? I thought since I define the syntax that those things would be generated by the tool. What I need to define exactly ? Why when I select the PythonScanner2 and PythonParser2 and click then Compile LR it gives a MNU receiver of method is nil ? I am using latest Pharo 4 image. My goal is to parse python types to similar pharo objects. I get those python types as strings and my main focus in lists, dictionaries and tuples . The tricky part is that one can contain the other inside in every imagined way. The way I understand it I will need something called "transformations" to convert those python types to OrderedCollections, Arrays etc and anything would make more sense for a pharo coder. Additionally what is the meaning of the vertibal bar ? --> | eg. | Number Are there any other tutorials that can help a beginner like me to understand these concepts ? I am not looking for someone to give me the solution to the plate, I would love to learn and understand parsing because I am very interested into making Pharo easy to mix with Python code and allow Pharo to use Python libraries without the user having to learn or code Python :) As you may imagine this is a crucial ingredient for my project Ephestos which tries to use Pharo to script Blender by either replacing or cooperating with blender python. So learning a good way to parse pharo code to python code and vice versa is extremely important for me. |
Hi Kilon,
I think it’s better that you take a look at PetitParser (e.g. http://www.themoosebook.org/book/internals/petit-parser ) I have found PetitParser more user friendly than SmaCC.
---> Save our in-boxes! http://emailcharter.org <---
Johan Fabry - http://pleiad.cl/~jfabry PLEIAD lab - Computer Science Department (DCC) - University of Chile |
In reply to this post by kilon.alios
2015-01-28 14:53 GMT+01:00 kilon alios <[hidden email]>:
Ok. I made sure the help was up to date with the current SmaCC; the online tutorial may differ a bit (GUI, some of the class creation commands).
Scanners divide the input stream in tokens with regular expressions. Parsers builds (sort of: at least they follow the steps) a tree (a parse tree) out of the tokens, parse tree which is reduced and simplified as an AST. And the AST represent the structure of the source code (as per the language definition in the grammar)
If you give new classes, it will create them. If you give existing classes, it will reuse them.
This is the issue I told you about in Pharo 4. The fix is waiting for review before integration (https://pharo.fogbugz.com/default.asp?14730).
With the PythonParser, you get a visitor generated, so you can subclass it and visit the ast produced by the parser and generate the relevant OrderedCollections.
It means or. For example, Atom: | Number ; Would mean that Atom is either empty or a number.
I don't have any at hand: I teach that at the moment, so I'm not delegating to a tutorial for my students ;)
The way it is set should support you well for doing what you want, so keep doing it :) Thierry |
In reply to this post by jfabry
Yes I am aware of PettitParser but the one thing that made me very interested into SmaCC is that it already supports Python syntax parsing , though I think is for Python 2 but if its 2.7 it wont be much issue for me that use Python 3.4 syntax. From a first look it looks PettitParser easier to use but Thiery told me SmaCC is also fairly easy and personally I dont care so much I only want something that gets the job done as fast as possible. Having an existing python parser certainly makes things faster for me. But yes I have played around with pettitparser and I really liked it. On Wed, Jan 28, 2015 at 4:22 PM, Johan Fabry <[hidden email]> wrote:
|
In reply to this post by Thierry Goubier
"Ok. I made sure the help was up to date with the current SmaCC; the online tutorial may differ a bit (GUI, some of the class creation commands)."
Yes I am not complaining about your effort. I am just new with parsing and everything looks alien to me :D I am mostly following your documentation inside pharo but I dont mind reading anything available. "Scanners divide the input stream in tokens with regular expressions." ok so regular expressions are used, good I am familiar with them because already used them to parse pharo messages to python method calls. At least I know something useful :) "If you give new classes, it will create them. If you give existing classes, it will reuse them." Ok so creating classes for parser an scanner is only optional. That mean that those classes can do some extra work that is not defined with the SmaCC syntax. "This is the issue I told you about in Pharo 4. The fix is waiting for review before integration (https://pharo.fogbugz.com/default.asp?14730)." Ah yes now I remember, I could review it I am a member of pharo fogbuz but I am clueless of how it works and what it affects so maybe I am not a good reviewer in this case. "With the PythonParser, you get a visitor generated, so you can subclass it and visit the ast produced by the parser and generate the relevant OrderedCollections." Roger so that means I will have to take a deep look into PythonParser class and try to figure things out. "It means or." I assumed so , but I wanted to make sure. "I don't have any at hand: I teach that at the moment, so I'm not delegating to a tutorial for my students ;)" I was not aware that you are a teacher and that you teach SmaCC , cool. So I think I will try to put more effort to reading the tutorial and experimenting with Pharo 3 till Pharo 4 is fixed. "The way it is set should support you well for doing what you want, so keep doing it :)" Great if you say I can do this , this is already great news for me. The effort is not a problem , no pain no gain. I will be back with more questions. |
Ok so I tried to parse a very simple list like [ 67,12342,5 ] using Parse and explore , I can find these number by following this AST (for example 67) PyFileInputNode>>statements: -> 1: PySimpleStmNode>>stmts: -> 1: PyExprStmtNode>>tests: ->1: PyPowerNode>>atom: -> PyAtomNode>>list: -> 1: PyPowerNode>>atom: -> PyPowerNode>>numberToken -> numberToken>>value -> 67 quite a structure, but the one thing I dont get is "tests" , why "tests" ? Does it tests something and if yes what ? |
Hi kilon, The tests instance variable is linked to the python grammar: top level items in an expression are probably tests, and, through the grammar, tests can be just atoms. So the tests instance variable doesn't means it is testing anything :) Thierry Le 30 janv. 2015 09:23, "kilon alios" <[hidden email]> a écrit :
|
Ok thanks for the info, I am still however curious about these "tests" are just tests (which may or may not happen) that determine the AST, for example which node to use ? Or are they tests related to unit testing class PythonParserTests ? Also you said I need to use the visitor created by PythonParser I assume you mean PyRootNodeVisitor ? Just as it is explained in the AST chapter of the documentation. In my case this simple python list will need me to subclass it and override method visitListmaker , the aListmaker passed as argument to the method should I assume it is PyListmakerNode ? On Fri, Jan 30, 2015 at 10:50 AM, Thierry Goubier <[hidden email]> wrote:
|
2015-01-30 14:04 GMT+01:00 kilon alios <[hidden email]>:
'tests' is just there because, in the grammar, there is this at a certain point: testlist: test 'test' "," testlist | test 'test' comma_opt ; I have named this use of test 'test', so SmaCC has deduced that testlist will be a list of test(s) nodes (or maybe other stuff such as atoms, depending on the productions for test).. so, SmaCC, in each rule where testlist is found, it will add a 'tests' instance variable. So, basically, the grammar rules explain how each node can be decomposed in sub-nodes, and the additional annotations (the 'test' naming and the {{}} or {{}}) drive how the classes for the nodes you want to keep are generated. In that testlist case, no node will be generated, but everywhere testlist appear on the right of the rule, then it will add a 'tests' instance variable.
Not at all :)
In my experience, what you need to do is you have a look at the ast generated and see if you can recognize the elements. From what I see in your simple example, the key to your list is that PyAtomNode instance with something in list. Once you have that, you know that you need to visit PyAtomNode (and check that it has the [ ] tokens). Looking into what is listmaker in atom in the grammar (congratulations by the way, you have seen it :) ), you'll see that it creates a listmaker node only in the first case: test followed by a list_for, otherwise it falls back to testlist... Thierry
|
thank for your congratulations, because at times I fear I ask too obvious questions. I have to say I find this parsing very complex but very fascinating too :) Time to experiment with the visitor. On Fri, Jan 30, 2015 at 11:49 PM, Thierry Goubier <[hidden email]> wrote:
|
Ok so after rereading the tutorial and testing again and again , I think I have finally managed to understand how SmaCC really works and I was succesful into converting simple python litsts to pharo arrays and ordered collections. The tricky part now is to apply this knowledge to complex python types like multi dimensional lists, tuples and dictionaries. I understand that the visitor allows me to visit a specific object instances each time they are found in the AST . But because I want to walk the AST in order to build multi dimensional ordered collections I need something more, or maybe my understanding of the visitor pattern is flawed. The problem I am having here is that each time I parse a python type that python type is not necessarily represented by different kind of node. For example whether its a list or a tuple or a dictionary the same class is used PyAtomNode. In order to differentiate between those diffirent python types PyAtomNode has instance variables for right and left bracket, parantheses, and curly. So my initial thinking is to check those instance variables to see if they are nil and from that I can conclude which python type I am parsing. So I can preform simple ifs that check that the instance variable is Nil or not but the question is if my strategy is a good one or a bad one. I could define my own syntax to simplify the AST tree including different nodes for different python types , because from the looks of it , it seems it is a bit too verbose for my needs but On the other hand I am not so sure because in the future my needs may become more verbose too. So I am very close and ready to create my full python types converter for pharo but I wanted some good advice before wasting time on something that is not efficient. By the way Thierry I have to agree with you Smacc is a very capable parser, also I like the use of regex syntax, makes it uglier compared Pettit Parser but I prefer the compact regex syntax to having to define and browse tons of classes and send tons of messages. Also the Python support is very good and I am impressed how easily SmaCC can parse whole python applications since some of the test are very complex. Well done great work! On Sat, Jan 31, 2015 at 12:04 AM, kilon alios <[hidden email]> wrote:
|
Hi Kilon,
2015-02-11 8:24 GMT+01:00 kilon alios <[hidden email]>:
Well, I see three things in what you wwant to achieve. The first one is that checking the node (it has parentheses, brackets, etc...) is a good way of determining its type during the visitor traversal. You may extend directly PyAtomNode with selectors such as isArray, isDictionary, etc... so as to cope with the fact that the parser puts different objects under the umbrella PyAtomNode. The second one is that I did most of the AST generation as a crude, get rid of warnings approach (and it took me long enough). Now that you have a better understanding of your requirements, it may be a good idea to revisit the grammar and ensure that the right type of node is generated. For example, if you rewrite the atom productions : atom: <lparen> <rparen> {{}} | <lparen> yield_expr 'list' <rparen> {{}} | <lparen> testlist_comp 'list' <rparen> {{}} | <lbrack> <rbrack> {{}} | <lbrack> listmaker 'list' <rbrack> {{}} | <lcurly> dictorsetmaker 'list' <rcurly> {{}} | <lcurly> <rcurly> {{}} | "`" testlist1 'list' "`" {{BackTick}} | <name> {{Symbol}} | <number> {{Number}} | strings ; What you see is that, with the {{}}, I create PyAtomNode instances for all productions, even if it isn't appropriate. Maybe this should be changed like that for lists : | <lbrack> <rbrack> {{List}} | <lbrack> listmaker 'list' <rbrack> {{List}} Like that I get PyListNode when I parse '[ ]' . I just have to tune the AST generation code so that it gives me the nodes I need, and, at the moment, you are the right person to do so since you're molding it to your needs. And the last one is about the visitor. For complex processing like the transformations you intend, I would see two strategies: a builder inside the visitor with a stack/context strategy, so that you can recurse in your visit of the ast and add elements to the right collection, or a simple recurse and merge the result of the lower visits (when in a List node, collect the visit of all the children as an array or as an OrderedCollection).
I believe this is approach two above: you can mold the grammar to your needs. As John Brant told me: an AST is not a parse tree. So you can, and should, adapt the AST generation in the way that makes sense to you.
I believe you are on the right path, if my explanations made sense :)
Thank to the help of all who have worked on Python parsing before and with me: having a large test base as I inherited from the previous parser is a huge benefit. Happy that you like it: implementing the ident/dedent tokens was the most interesting part in there. If you start changing the grammar as suggested above, make a fork and pull requests on github :) |
"What you see is that, with the {{}}, I create PyAtomNode instances for all productions, even if it isn't appropriate. Maybe this should be changed like that for lists : | <lbrack> <rbrack> {{List}} | <lbrack> listmaker 'list' <rbrack> {{List}}" Both approaches you described a) adding instance methods to PyAtomNode that provide checks for the type b) Creating separate nodes for diffirent types work for me. Solution (b) seemed more smalltalky to me. Maybe a best compromise would be to have for example PyListNode as a subclass of PyAtomNode ? If I can create something that others find useful too, I certainly would prefer it. My needs are not very specific, I think I want pretty much what anyone would want for importing data from python to pharo . One way or another I will satisfy my needs this is not my worry. "And the last one is about the visitor. For complex processing like the transformations you intend, I would see two strategies: a builder inside the visitor with a stack/context strategy, so that you can recurse in your visit of the ast and add elements to the right collection, or a simple recurse and merge the result of the lower visits (when in a List node, collect the visit of all the children as an array or as an OrderedCollection)." Yes that was my way of thinking too. A collection of methods that consume the AST , walk the tree and build a more simplified structure. The problem I was having was two side, from one side PyAtomNode is used for several diffirent things. From the other side not only lists, dictionaries, tupples can be multidimensional but also can act as containers for each other. So a list can contain a dictionary which can contain a list which can contain a tupple and as you imagine the rabbit whole can go very deep. Of course nothing of this is surprising for any language. Generally speaking this is not such a big problem right now for me because I prefer dealing with simple types with a bit of multidimensionality. Most of the types Blender uses is like that. But it may become a problem later on for example if the user wants to have access to the node system of blender. Nodes can easily contain other nodes and it can create a nightmare scenario in that case but I will leave that for when the time comes. "I believe you are on the right path, if my explanations made sense :)" Your explanation not only made sense but you pretty much described what I was considering doing. "If you start changing the grammar as suggested above, make a fork and pull requests on github :)" Will do, my focus is in latest python 3 because its what Blender uses, but on types should not make any diffirence. |
In reply to this post by kilon.alios
Hi!
Maybe you also want to have a look at http://www.squeaksource.com/openqwaq/ There is a part PyBridge included. MAybe you can take some of the Smalltalk Python Cmodel classes from there. Sebastian Am 10.02.2015 um 23:24 schrieb kilon alios:
|
In reply to this post by kilon.alios
2015-02-11 11:16 GMT+01:00 kilon alios <[hidden email]>:
Yes, if you find that appropriate or if they share some implementation bits (I'm not sure of the latter, but it may help to organise stuff). What you do is, in the grammar, you add a %hierarchy directive, like that: %hierarchy Atom (List Dictionary); And, at AST generation, SmaCC will inherit as much as possible from Atom definition in List and Dictionary (at least, I suppose it does: some of the SmaCC code generation tools are rather impressive, and, if you look carefully, it is prepared for more than just generation of Smalltalk code)
And that's fine to refactor the grammar to match your requirements :)
For me, the way to deal with that is to have a model of that data and visitors on it, to tackle the "future" complexity. If you go that way (recursive structure), two good (and not that easy) things to have are: equality and copy.
See, I don't have much to do then ;)
Good point. It would be nice to have a diff on the grammars of Python2 and Python3; at the moment, there is space in the naming for a Python 3 parser (named PythonParser3, of course), however we would have a collision on the AST nodes (two PyAtomNodes, two ...). Have to consider that and maybe have Py2AtomNode, etc... to leave space for the Python 3 AST if it differs too much. Thierry |
In reply to this post by Sebastian Heidbrink-2
Well I tried in the past to use openqwaq , but what I learned from the experience is that sometimes understanding others code can be even more time consuming than remaking it yourself. I spent like an afternoon trying to understand openqwaq before starting Ephestos from complete scratch. Suffice to say that in another afternoon I had the prototype of Ephestos working with the ability to send simple python strings to the cptyhon interpreter to be execute by the time I was not even able to understand the basic architecture of openqwaq. The module you recommended alone contains like 11.000 lines of code. Thats a lot of work to just read that code and I cant find any class named CModel. In any case I like it that I started from scratch, I want to keep my project small, light, well documented and super easy to hack. Even if that means less features :) Also the code is written for squeak not pharo. In any case I will need someone like Thierry to guide me through the code and help me understand. So its definitely not as simple as "hey I am going to use that code". Executing python code with Ephestos is already possible and easy , my problems now are a) converting from python to pharo types which is why I am using SmaCC b) use of callbacks , which should be easy to do , so that python code can call pharo code. In the future if socket are proven too slow I may move my architecture into a shared DLL instead to make things faster. But as far as using python code from pharo I am very close to achieving my goal by the end of this year. On Wed, Feb 11, 2015 at 4:40 PM, Sebastian Heidbrink <[hidden email]> wrote:
|
"Yes, if you find that appropriate or if they share some implementation bits (I'm not sure of the latter, but it may help to organise stuff). What you do is, in the grammar, you add a %hierarchy directive, like that: %hierarchy Atom (List Dictionary); And, at AST generation, SmaCC will inherit as much as possible from Atom definition in List and Dictionary (at least, I suppose it does: some of the SmaCC code generation tools are rather impressive, and, if you look carefully, it is prepared for more than just generation of Smalltalk code)" Nice and I was meaning to ask that how to do this with the grammar. Great answer. "For me, the way to deal with that is to have a model of that data and visitors on it, to tackle the "future" complexity. If you go that way (recursive structure), two good (and not that easy) things to have are: equality and copy." this I dont understand. I want python types to convert to popular pharo data classes . I don't want to create my own custom classes because the way I see it python and pharo are very close together as dynamic languages. So my thinking is that if equality works for an ordered collection and a array , then there is no need for me to add anything new since I will convert to those classes. Unless you mean something else that I am missing here. My goal is to allow people to use python code and libraries without having to worry for how the python side maps to the pharo side. At least not when it comes to types. So they will use those python libraries as if they are pharo libraries. A challenge will be converting types that reference python objects. The good news is that SmaCC already handles this situation but I will have to find a way to sync references between pharo and python. I think I know a way for that one. "Good point. It would be nice to have a diff on the grammars of Python2 and Python3; at the moment, there is space in the naming for a Python 3 parser (named PythonParser3, of course), however we would have a collision on the AST nodes (two PyAtomNodes, two ...). Have to consider that and maybe have Py2AtomNode, etc... to leave space for the Python 3 AST if it differs too much." From what I have seen , because I have to confess I am more of python 3 than python 2 coder, there are no big diffirences in python types at least not on the basic ones, but syntax wise there is as it is explained very well in this reference So yes there is a clear diffirence between python 3 and python 2 but its not massive. And yes those 2 grammars should be kept separately. For the time being I have no reason to encourage you to make a python 3 grammar port, because frankly I dont need it. I am happy with SmaCC as it is. There is no need for my project to fully parse python syntax and I dont see in the next years at least any need to go beyond python types. If the need arises and no py3 parser exists I will of course make it myself and send you a pull request. I am so grateful that already SmaCC has saved me from so much work , thanks to all people contributing to SmaCC , one very happy customer :D On Wed, Feb 11, 2015 at 7:36 PM, kilon alios <[hidden email]> wrote:
|
In reply to this post by kilon.alios
The problem with reading class oriented code (C++, Java, C#, Ruby,
..., Smalltalk) is that the code does not reveal how the system will
work at runtime (polymorphism). The essence of object orientation is
that objects collaborate to achieve a goal. The DCI programming
paradigm adds code for how the system works at runtime:
Cheers --Trygve On 11.02.2015 18:36, kilon alios wrote:
--
The essence of object orientation is
that objects collaborate to achieve a
goal. |
In reply to this post by Thierry Goubier
yes and I would love to see that documented in the chapter :)
Le 11/2/15 09:54, Thierry Goubier a écrit : > implementing the ident/dedent tokens was the most interesting part in > there. |
file_input: {{}} | file_input <NEWLINE> {{}} | file_input stmt 'statement' {{}} ; |
Free forum by Nabble | Edit this page |