Smalltalk › Frameworks & Tools › Moose

Indentation Sensitive Languages in PetitParser

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

kurs.jan

Indentation Sensitive Languages in PetitParser

Hi All,

I have pushed a new version for indentation sensitivity into the PetitParser. If I am not mistaken, with the new changes, you can parse languages such as Python, F#, Haskell, YAML or Markdown. There are proof of concept implementations in the PetitParser repository. YAML and Markdwon are almost complete, for Python only the indentation-sensitive rules.

There is a small introduction to the indentation parsing with PetitParser (http://scg.unibe.ch/research/indentParsing), let me know, if you are missing something or if you want to know something more. There are also many examples of the indentation sensitive grammars in the repository.

If you wonder about performance, the indentation sensitive grammar is approximately five times slower, because the position and an extra indentation stack is remembered instead of the simple position. If you don't use indentation, there is no slowdown. I am working on improvements, any ideas are welcome :)

If you don't like new changes, you can still stable version (ConfigurationOfPetitParser loadStable) or version 1.11.

Regards,

Jan

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev

Thierry Goubier

Re: Indentation Sensitive Languages in PetitParser

Hi Jan,

2015-04-02 16:36 GMT+02:00 Jan Kurš <[hidden email]>:

Hi All,

I have pushed a new version for indentation sensitivity into the PetitParser. If I am not mistaken, with the new changes, you can parse languages such as Python, F#, Haskell, YAML or Markdown. There are proof of concept implementations in the PetitParser repository. YAML and Markdwon are almost complete, for Python only the indentation-sensitive rules.

There is a small introduction to the indentation parsing with PetitParser (http://scg.unibe.ch/research/indentParsing), let me know, if you are missing something or if you want to know something more. There are also many examples of the indentation sensitive grammars in the repository.

This is interesting. Thanks for the description and the link.

One note in the end: isn't it that in fact what you are eating with the indentation stack is more than whitespace (i.e. bullet points)? See last paragraph, because your approach benefit is that you can use more than whitespace as indent, anything arbitrary would do, if I have understood correctly.

If you wonder about performance, the indentation sensitive grammar is approximately five times slower, because the position and an extra indentation stack is remembered instead of the simple position. If you don't use indentation, there is no slowdown. I am working on improvements, any ideas are welcome :)

Totally out of the blue: your typical indentation parsing expression is a regex type of thing. What about compiling it as a SmaCC scanner? Such a PP parser would be equivalent to a Lexing automaton, and good compilers exist for such automatons.

Thierry

If you don't like new changes, you can still stable version (ConfigurationOfPetitParser loadStable) or version 1.11.

Regards,
Jan

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev

Jan Kurš

Re: Indentation Sensitive Languages in PetitParser

Hey Thierry,

Thank you for your input!

On 2 April 2015 at 16:59, Thierry Goubier <[hidden email]> wrote:

Hi Jan,

2015-04-02 16:36 GMT+02:00 Jan Kurš <[hidden email]>:
Hi All,

I have pushed a new version for indentation sensitivity into the PetitParser. If I am not mistaken, with the new changes, you can parse languages such as Python, F#, Haskell, YAML or Markdown. There are proof of concept implementations in the PetitParser repository. YAML and Markdwon are almost complete, for Python only the indentation-sensitive rules.

There is a small introduction to the indentation parsing with PetitParser (http://scg.unibe.ch/research/indentParsing), let me know, if you are missing something or if you want to know something more. There are also many examples of the indentation sensitive grammars in the repository.

This is interesting. Thanks for the description and the link.

One note in the end: isn't it that in fact what you are eating with the indentation stack is more than whitespace (i.e. bullet points)? See last paragraph, because your approach benefit is that you can use more than whitespace as indent, anything arbitrary would do, if I have understood correctly.

Oh, you are absolutely right, you can consume any parsing expression with the indentation stack (even though I would prefer nothing more complicated than regex there :).

I did not mentioned it in the link above. But for example in Markdown, one use the indentation stack to recognize nesting of quoted blocks and lists, that starts with '>' and with whitespaces. In markdown, each quoted block or nested list pushes to the indentation stack and all the expressions on the indentation stack are evaluated in a sequence at the beginning of each line. So IMO you understand that correctly.

If you wonder about performance, the indentation sensitive grammar is approximately five times slower, because the position and an extra indentation stack is remembered instead of the simple position. If you don't use indentation, there is no slowdown. I am working on improvements, any ideas are welcome :)

Totally out of the blue: your typical indentation parsing expression is a regex type of thing. What about compiling it as a SmaCC scanner? Such a PP parser would be equivalent to a Lexing automaton, and good compilers exist for such automatons.

Good point, I do plan to do something like this for any parsing expression to get some performance. And it is a good idea to use SmaCC infrastructure for this.

Yet, the main performance problem with indentation expressions is to manage the indentation stack. That stack needs to be memoized (before choices, sequences, predicates) and restored if necessary, which takes most of the time. I hope some copy-on-demand will lower the overhead.

Thierry

Jan

If you don't like new changes, you can still stable version (ConfigurationOfPetitParser loadStable) or version 1.11.

Regards,
Jan

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev

_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev

Thierry Goubier

Re: Indentation Sensitive Languages in PetitParser

Le 02/04/2015 18:30, Jan Kurš a écrit :
> Hey Thierry,
>
> Thank you for your input!

You're doing interesting stuff :)

> On 2 April 2015 at 16:59, Thierry Goubier <[hidden email]
> <mailto:[hidden email]>> wrote:
>
> Hi Jan,
>
> 2015-04-02 16:36 GMT+02:00 Jan Kurš <[hidden email]
> <mailto:[hidden email]>>:
>
> Hi All,
>
> I have pushed a new version for indentation sensitivity into the
> PetitParser. If I am not mistaken, with the new changes, you can
> parse languages such as Python, F#, Haskell, YAML or Markdown.
> There are proof of concept implementations in the PetitParser
> repository. YAML and Markdwon are almost complete, for Python
> only the indentation-sensitive rules.
>
> There is a small introduction to the indentation parsing with
> PetitParser (http://scg.unibe.ch/research/indentParsing), let me
> know, if you are missing something or if you want to know
> something more. There are also many examples of the indentation
> sensitive grammars in the repository.
>
>
> This is interesting. Thanks for the description and the link.
>
> One note in the end: isn't it that in fact what you are eating with
> the indentation stack is more than whitespace (i.e. bullet points)?
> See last paragraph, because your approach benefit is that you can
> use more than whitespace as indent, anything arbitrary would do, if
> I have understood correctly.
>
>
> Oh, you are absolutely right, you can consume any parsing expression
> with the indentation stack (even though I would prefer nothing more
> complicated than regex there :).

I would say that if you go above regex there, then you are trying to
solve a more general parsing problem than the indentation :)

> I did not mentioned it in the link above. But for example in Markdown,
> one use the indentation stack to recognize nesting of quoted blocks and
> lists, that starts with '>' and with whitespaces. In markdown, each
> quoted block or nested list pushes to the indentation stack and all the
> expressions on the indentation stack are evaluated in a sequence at the
> beginning of each line. So IMO you understand that correctly.

Well, I think that in your documentation, you are giving an example :)

Note that if you handle indent at the beginning of a line in a
lex/flex/SmaCC scanner with states, you do have the same ability to eat
anything since what you are doing is providing a regex for the start of
the line (i.e., you're also not limited to whitespace in a traditional
parser).

So I'll revise my analysis: your benefit is the ability to recursively
stack indentation states. The state lexing automaton in SmaCC/Lex/Flex
can switch states, but not stack them.

> If you wonder about performance, the indentation sensitive
> grammar is approximately five times slower, because the position
> and an extra indentation stack is remembered instead of the
> simple position. If you don't use indentation, there is no
> slowdown. I am working on improvements, any ideas are welcome :)
>
>
> Totally out of the blue: your typical indentation parsing expression
> is a regex type of thing. What about compiling it as a SmaCC
> scanner? Such a PP parser would be equivalent to a Lexing automaton,
> and good compilers exist for such automatons.
>
>
> Good point, I do plan to do something like this for any parsing
> expression to get some performance. And it is a good idea to use SmaCC
> infrastructure for this.

The SmaCC scanner compilation infrastructure is interesting. I spent
some time optimizing it to solve one problem: the generation of methods
which were too long for the vm bytecode; and to try to get each method
to be jitted (the 'trying to fit a method in the JIT' thread).

It has some inefficiencies, in part due to the fact that the SmaCC
scanner can generate multiple tokens for the same lexem. Used for
ambiguous parses in GLR mode.

> Yet, the main performance problem with indentation expressions is to
> manage the indentation stack. That stack needs to be memoized (before
> choices, sequences, predicates) and restored if necessary, which takes
> most of the time. I hope some copy-on-demand will lower the overhead.

Hum, there must be some hidden cost there because the cost of handling
the indentation stack in the SmaCC python parser is negligeable.

Thierry
_______________________________________________
Moose-dev mailing list
[hidden email]
https://www.iam.unibe.ch/mailman/listinfo/moose-dev