PetitParser speed and cost of PPFailure

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

PetitParser speed and cost of PPFailure

Holger Freyther
Hi,

once again I am not sure if this is the right list. The first parser I wrote using
PetitParser was a SIP (and then MGCP) parser. I have recently ported[1] the
code to Pharo and with Pharo it is very tempting to Use BlockClosure>>#bench
to get an idea of the speed.


I have two performance “issues” and wonder if others hand similar issues with
PetitParser and if there is a general approach to this.



1.) Combining two PPCharsetPredicates does not combine the “classification”
table it had. One could create a PPPredicateObjectParser subclass that is
special casing >>#/ to build a combined classification table.


2.) When blindly following a BNF enumeration of "A or B or C or D or E
or CatchAll” and each “A, B” follow common pattern (e.g. token COLON value)
one pays a high cost in the backtracking and constructing the PPFailure for
each failed case.

In my SIPGrammar I have action parsers for To ==>.. From ==>  and would
like to keep that. At the same time I would be happy if the token in front of the
colon is only consumed once and then delegated to the right parser and if that
one failed use the ‘catch all’ one.

I don’t know which abstraction would be needed to allow creating optimized
PetitParsers for such grammars.

sorry for the long mail, long details and context is below.


kind regards
        holger






Full details:


1.) CharSetPredicate

| aParser bParser combinedParser aTime bTime cTime |

aParser := #digit asParser.
bParser := #letter asParser.
combinedParser := aParser / bParser.

aTime := [  aParser parse: 'b'] bench.
bTime := [ bParser parse: 'b'] bench.
cTime := [ combinedParser parse: 'b'] bench.
{ aTime. bTime. cTime }

cTime is bounded by the time execution time of of the slowest
of these parsers + overhead (e.g. PPFailure creation).

e.g.

#('559,000 per second.' '1,010,000 per second.' '429,000 per second.')

With a proof of concept PPPredicateCharSetParser

#('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per second.’)

The noise is pretty string here but what is important is that bParser and the
combinedParser are in the same ballpark.

2.) Choice Parser



The BNF grammar of the parser is roughly:

Request        =  Request-Line
                  *( message-header )
                  CRLF
                  [ message-body ]

message-header  =  (Accept
                …
                / To
                / From
                / Via
                /  extension-header) CRLF

Alert-Info   =  "Alert-Info" HCOLON alert-param *(COMMA alert-param)
Accept         =  "Accept" HCOLON
                   [ accept-range *(COMMA accept-range) ]


So there can be several lines of “message-header”. And each method header
starts with a token/word, a colon and then the parameter. “extension-header”
is kind of a catch all if no other rule matched. E.g. if a client sends a To which is
wrongly encoded it would end up with the extension-header.

I transferred the above naively to PetitParser and end up with something like
parsing ~500 messages a second. The main cost appears to come from the
choice parser that needs to create a PetitFailure all the time. E.g. if you have a
line like this:

        ‘From: “Holger Freyther” <sip:[hidden email]>’

The choice parser will start with the “Accept” rule, parse the token (“From” and
then create a PPFailure, then … rules, then “To”, parse the token.. So we have
parsing the same token more than once and creating PPFailures all the time. I
ended up creating a custom parser that will peek the token, have a horrible chain
of token = ‘XYZ’ ifTrue and then dispatch to the other rule.

It would be nice if PetitParser could be taught to only parse the token once and
then delegate to the param rule. E.g. a PPAnyOfParser that allows to specify the
token to match, the parser to continue with and a fallback parser?



[1]
http://smalltalkhub.com/#!/~osmocom/SIP
http://smalltalkhub.com/#!/~osmocom/MGCP
Reply | Threaded
Open this post in threaded view
|

Re: PetitParser speed and cost of PPFailure

Tudor Girba-2
You will probably have more chances to get a response on the Moose mailing list.

Cheers,
Doru

On Sun, Aug 16, 2015 at 5:58 PM, Holger Freyther <[hidden email]> wrote:
Hi,

once again I am not sure if this is the right list. The first parser I wrote using
PetitParser was a SIP (and then MGCP) parser. I have recently ported[1] the
code to Pharo and with Pharo it is very tempting to Use BlockClosure>>#bench
to get an idea of the speed.


I have two performance “issues” and wonder if others hand similar issues with
PetitParser and if there is a general approach to this.



1.) Combining two PPCharsetPredicates does not combine the “classification”
table it had. One could create a PPPredicateObjectParser subclass that is
special casing >>#/ to build a combined classification table.


2.) When blindly following a BNF enumeration of "A or B or C or D or E
or CatchAll” and each “A, B” follow common pattern (e.g. token COLON value)
one pays a high cost in the backtracking and constructing the PPFailure for
each failed case.

In my SIPGrammar I have action parsers for To ==>.. From ==>  and would
like to keep that. At the same time I would be happy if the token in front of the
colon is only consumed once and then delegated to the right parser and if that
one failed use the ‘catch all’ one.

I don’t know which abstraction would be needed to allow creating optimized
PetitParsers for such grammars.

sorry for the long mail, long details and context is below.


kind regards
        holger






Full details:


1.) CharSetPredicate

| aParser bParser combinedParser aTime bTime cTime |

aParser := #digit asParser.
bParser := #letter asParser.
combinedParser := aParser / bParser.

aTime := [  aParser parse: 'b'] bench.
bTime := [ bParser parse: 'b'] bench.
cTime := [ combinedParser parse: 'b'] bench.
{ aTime. bTime. cTime }

cTime is bounded by the time execution time of of the slowest
of these parsers + overhead (e.g. PPFailure creation).

e.g.

#('559,000 per second.' '1,010,000 per second.' '429,000 per second.')

With a proof of concept PPPredicateCharSetParser

#('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per second.’)

The noise is pretty string here but what is important is that bParser and the
combinedParser are in the same ballpark.

2.) Choice Parser



The BNF grammar of the parser is roughly:

Request        =  Request-Line
                  *( message-header )
                  CRLF
                  [ message-body ]

message-header  =  (Accept
                …
                / To
                / From
                / Via
                /  extension-header) CRLF

Alert-Info   =  "Alert-Info" HCOLON alert-param *(COMMA alert-param)
Accept         =  "Accept" HCOLON
                   [ accept-range *(COMMA accept-range) ]


So there can be several lines of “message-header”. And each method header
starts with a token/word, a colon and then the parameter. “extension-header”
is kind of a catch all if no other rule matched. E.g. if a client sends a To which is
wrongly encoded it would end up with the extension-header.

I transferred the above naively to PetitParser and end up with something like
parsing ~500 messages a second. The main cost appears to come from the
choice parser that needs to create a PetitFailure all the time. E.g. if you have a
line like this:

        ‘From: “Holger Freyther” <[hidden email]>’

The choice parser will start with the “Accept” rule, parse the token (“From” and
then create a PPFailure, then … rules, then “To”, parse the token.. So we have
parsing the same token more than once and creating PPFailures all the time. I
ended up creating a custom parser that will peek the token, have a horrible chain
of token = ‘XYZ’ ifTrue and then dispatch to the other rule.

It would be nice if PetitParser could be taught to only parse the token once and
then delegate to the param rule. E.g. a PPAnyOfParser that allows to specify the
token to match, the parser to continue with and a fallback parser?



[1]
http://smalltalkhub.com/#!/~osmocom/SIP
http://smalltalkhub.com/#!/~osmocom/MGCP



--

"Every thing has its own flow"
Reply | Threaded
Open this post in threaded view
|

Re: PetitParser speed and cost of PPFailure

stepharo
In reply to this post by Holger Freyther
Hi holger

At ESUG Jan showed that he is working on a kind of PetitParser compiler.

Le 16/8/15 17:58, Holger Freyther a écrit :

> Hi,
>
> once again I am not sure if this is the right list. The first parser I wrote using
> PetitParser was a SIP (and then MGCP) parser. I have recently ported[1] the
> code to Pharo and with Pharo it is very tempting to Use BlockClosure>>#bench
> to get an idea of the speed.
>
>
> I have two performance “issues” and wonder if others hand similar issues with
> PetitParser and if there is a general approach to this.
>
>
> 1.) Combining two PPCharsetPredicates does not combine the “classification”
> table it had. One could create a PPPredicateObjectParser subclass that is
> special casing >>#/ to build a combined classification table.
>
> 2.) When blindly following a BNF enumeration of "A or B or C or D or E
> or CatchAll” and each “A, B” follow common pattern (e.g. token COLON value)
> one pays a high cost in the backtracking and constructing the PPFailure for
> each failed case.
>
> In my SIPGrammar I have action parsers for To ==>.. From ==>  and would
> like to keep that. At the same time I would be happy if the token in front of the
> colon is only consumed once and then delegated to the right parser and if that
> one failed use the ‘catch all’ one.
>
> I don’t know which abstraction would be needed to allow creating optimized
> PetitParsers for such grammars.
>
> sorry for the long mail, long details and context is below.
What thierry will tell you is that if you do not need to compose your
grammars
and the grammar is not evolving and you need speed then a SmaCC
approache with LGR, LALR or something else
is probably faster.

Stef

>
>
> kind regards
> holger
>
>
>
>
>
>
> Full details:
>
>
> 1.) CharSetPredicate
>
> | aParser bParser combinedParser aTime bTime cTime |
>
> aParser := #digit asParser.
> bParser := #letter asParser.
> combinedParser := aParser / bParser.
>
> aTime := [  aParser parse: 'b'] bench.
> bTime := [ bParser parse: 'b'] bench.
> cTime := [ combinedParser parse: 'b'] bench.
> { aTime. bTime. cTime }
>
> cTime is bounded by the time execution time of of the slowest
> of these parsers + overhead (e.g. PPFailure creation).
>
> e.g.
>
> #('559,000 per second.' '1,010,000 per second.' '429,000 per second.')
>
> With a proof of concept PPPredicateCharSetParser
>
> #('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per second.’)
>
> The noise is pretty string here but what is important is that bParser and the
> combinedParser are in the same ballpark.
>
> 2.) Choice Parser
>
>
>
> The BNF grammar of the parser is roughly:
>
> Request        =  Request-Line
>                    *( message-header )
>                    CRLF
>                    [ message-body ]
>
> message-header  =  (Accept
> …
> / To
> / From
> / Via
> /  extension-header) CRLF
>
> Alert-Info   =  "Alert-Info" HCOLON alert-param *(COMMA alert-param)
> Accept         =  "Accept" HCOLON
>                     [ accept-range *(COMMA accept-range) ]
>
>
> So there can be several lines of “message-header”. And each method header
> starts with a token/word, a colon and then the parameter. “extension-header”
> is kind of a catch all if no other rule matched. E.g. if a client sends a To which is
> wrongly encoded it would end up with the extension-header.
>
> I transferred the above naively to PetitParser and end up with something like
> parsing ~500 messages a second. The main cost appears to come from the
> choice parser that needs to create a PetitFailure all the time. E.g. if you have a
> line like this:
>
> ‘From: “Holger Freyther” <sip:[hidden email]>’
>
> The choice parser will start with the “Accept” rule, parse the token (“From” and
> then create a PPFailure, then … rules, then “To”, parse the token.. So we have
> parsing the same token more than once and creating PPFailures all the time. I
> ended up creating a custom parser that will peek the token, have a horrible chain
> of token = ‘XYZ’ ifTrue and then dispatch to the other rule.
>
> It would be nice if PetitParser could be taught to only parse the token once and
> then delegate to the param rule. E.g. a PPAnyOfParser that allows to specify the
> token to match, the parser to continue with and a fallback parser?
>
>
>
> [1]
> http://smalltalkhub.com/#!/~osmocom/SIP
> http://smalltalkhub.com/#!/~osmocom/MGCP
>


Reply | Threaded
Open this post in threaded view
|

Re: PetitParser speed and cost of PPFailure

Jan Vrany
In reply to this post by Holger Freyther
Hi Holger,

Jan and me are working on a PetitCompiler [1,2], a tool whose aim
is to compile PetitParsers to Smalltalk to gain performance and
allow for hand-tuning. In short, we try to generate a tokenizer
and LL(1) parser on top of it. In general case, given the flexibility
of PetitParser, this is rather tricky :-)

The results are - performance wise - not bad, IMO [3].
As long as your grammar is sort of LL(1) it should not backtrack
(but Jan would give you more details, this is his domain)

Now, the whole thing is work-in-progress and by no means
is - as of today - production-ready tool. I'd like to give it
a try and benchmark it as it would make nice case-study for us.
But - no promises time-wise. I'm bloody busy now and this is
a sort of spare time-project for me.

Best, Jan

[1] https://bitbucket.org/janvrany/stx-goodies-petitparser
[2] http://smalltalkhub.com/#!/~JanKurs/PetitParser/
[3] http://bit.ly/1IXnz0c 


On Sun, 2015-08-16 at 17:58 +0200, Holger Freyther wrote:

> Hi,
>
> once again I am not sure if this is the right list. The first parser
> I wrote using
> PetitParser was a SIP (and then MGCP) parser. I have recently
> ported[1] the
> code to Pharo and with Pharo it is very tempting to Use
> BlockClosure>>#bench
> to get an idea of the speed.
>
>
> I have two performance “issues” and wonder if others hand similar
> issues with
> PetitParser and if there is a general approach to this.
>
>
>
> 1.) Combining two PPCharsetPredicates does not combine the
> “classification”
> table it had. One could create a PPPredicateObjectParser subclass
> that is
> special casing >>#/ to build a combined classification table.
>
>
> 2.) When blindly following a BNF enumeration of "A or B or C or D or
> E
> or CatchAll” and each “A, B” follow common pattern (e.g. token COLON
> value)
> one pays a high cost in the backtracking and constructing the
> PPFailure for
> each failed case.
>
> In my SIPGrammar I have action parsers for To ==>.. From ==>  and
> would
> like to keep that. At the same time I would be happy if the token in
> front of the
> colon is only consumed once and then delegated to the right parser
> and if that
> one failed use the ‘catch all’ one.
>
> I don’t know which abstraction would be needed to allow creating
> optimized
> PetitParsers for such grammars.
>
> sorry for the long mail, long details and context is below.
>
>
> kind regards
> holger
>
>
>
>
>
>
> Full details:
>
>
> 1.) CharSetPredicate
>
> > aParser bParser combinedParser aTime bTime cTime |
>
> aParser := #digit asParser.
> bParser := #letter asParser.
> combinedParser := aParser / bParser.
>
> aTime := [  aParser parse: 'b'] bench.
> bTime := [ bParser parse: 'b'] bench.
> cTime := [ combinedParser parse: 'b'] bench.
> { aTime. bTime. cTime }
>
> cTime is bounded by the time execution time of of the slowest
> of these parsers + overhead (e.g. PPFailure creation).
>
> e.g.
>
> #('559,000 per second.' '1,010,000 per second.' '429,000 per
> second.')
>
> With a proof of concept PPPredicateCharSetParser
>
> #('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per
> second.’)
>
> The noise is pretty string here but what is important is that bParser
> and the
> combinedParser are in the same ballpark.
>
> 2.) Choice Parser
>
>
>
> The BNF grammar of the parser is roughly:
>
> Request        =  Request-Line
>                   *( message-header )
>                   CRLF
>                   [ message-body ]
>
> message-header  =  (Accept
> …
> / To
> / From
> / Via
> /  extension-header) CRLF
>
> Alert-Info   =  "Alert-Info" HCOLON alert-param *(COMMA alert-param)
> Accept         =  "Accept" HCOLON
>                    [ accept-range *(COMMA accept-range) ]
>
>
> So there can be several lines of “message-header”. And each method
> header
> starts with a token/word, a colon and then the parameter. “extension
> -header”
> is kind of a catch all if no other rule matched. E.g. if a client
> sends a To which is
> wrongly encoded it would end up with the extension-header.
>
> I transferred the above naively to PetitParser and end up with
> something like
> parsing ~500 messages a second. The main cost appears to come from
> the
> choice parser that needs to create a PetitFailure all the time. E.g.
> if you have a
> line like this:
>
> ‘From: “Holger Freyther” <sip:[hidden email]>’
>
> The choice parser will start with the “Accept” rule, parse the token
> (“From” and
> then create a PPFailure, then … rules, then “To”, parse the token..
> So we have
> parsing the same token more than once and creating PPFailures all the
> time. I
> ended up creating a custom parser that will peek the token, have a
> horrible chain
> of token = ‘XYZ’ ifTrue and then dispatch to the other rule.
>
> It would be nice if PetitParser could be taught to only parse the
> token once and
> then delegate to the param rule. E.g. a PPAnyOfParser that allows to
> specify the
> token to match, the parser to continue with and a fallback parser?
>
>
>
> [1]
> http://smalltalkhub.com/#!/~osmocom/SIP
> http://smalltalkhub.com/#!/~osmocom/MGCP
>

Reply | Threaded
Open this post in threaded view
|

Re: PetitParser speed and cost of PPFailure

kurs.jan


Hi Holger,

Thank you for bringing this.

In the first case, I suggest to create a specialised parser, that fills in the classification as you desire.

In the second case, the left refactoring of the common prefix is the way to go.

As you are speaking about general solution,
there is a PetitParser rewriter, that can do what you mentioned. Unfortunately I cannot provide more details as I never really got into this.

The other approach, as the first Jan already mentioned, is a compiler that takes PetitParser as an input and produces a class representing a parser that recognizes equivalent language and that should resemble hand-written top-down parser.

During the compilation, we do many optimizations and I believe we can get decent performance.

As the compiler is not production ready and as we are looking for a use cases, I can import your grammar and let you know how to make it working with our tool and what would be the performance gain. In that case, I would need some test cases and benchmark inputs from you.

Cheers Jan


On Mon, 17 Aug 2015 08:30 Jan Vrany <[hidden email]> wrote:
Hi Holger,

Jan and me are working on a PetitCompiler [1,2], a tool whose aim
is to compile PetitParsers to Smalltalk to gain performance and
allow for hand-tuning. In short, we try to generate a tokenizer
and LL(1) parser on top of it. In general case, given the flexibility
of PetitParser, this is rather tricky :-)

The results are - performance wise - not bad, IMO [3].
As long as your grammar is sort of LL(1) it should not backtrack
(but Jan would give you more details, this is his domain)

Now, the whole thing is work-in-progress and by no means
is - as of today - production-ready tool. I'd like to give it
a try and benchmark it as it would make nice case-study for us.
But - no promises time-wise. I'm bloody busy now and this is
a sort of spare time-project for me.

Best, Jan

[1] https://bitbucket.org/janvrany/stx-goodies-petitparser
[2] http://smalltalkhub.com/#!/~JanKurs/PetitParser/
[3] http://bit.ly/1IXnz0c


On Sun, 2015-08-16 at 17:58 +0200, Holger Freyther wrote:
> Hi,
>
> once again I am not sure if this is the right list. The first parser
> I wrote using
> PetitParser was a SIP (and then MGCP) parser. I have recently
> ported[1] the
> code to Pharo and with Pharo it is very tempting to Use
> BlockClosure>>#bench
> to get an idea of the speed.
>
>
> I have two performance “issues” and wonder if others hand similar
> issues with
> PetitParser and if there is a general approach to this.
>
>
>
> 1.) Combining two PPCharsetPredicates does not combine the
> “classification”
> table it had. One could create a PPPredicateObjectParser subclass
> that is
> special casing >>#/ to build a combined classification table.
>
>
> 2.) When blindly following a BNF enumeration of "A or B or C or D or
> E
> or CatchAll” and each “A, B” follow common pattern (e.g. token COLON
> value)
> one pays a high cost in the backtracking and constructing the
> PPFailure for
> each failed case.
>
> In my SIPGrammar I have action parsers for To ==>.. From ==>  and
> would
> like to keep that. At the same time I would be happy if the token in
> front of the
> colon is only consumed once and then delegated to the right parser
> and if that
> one failed use the ‘catch all’ one.
>
> I don’t know which abstraction would be needed to allow creating
> optimized
> PetitParsers for such grammars.
>
> sorry for the long mail, long details and context is below.
>
>
> kind regards
>       holger
>
>
>
>
>
>
> Full details:
>
>
> 1.) CharSetPredicate
>
> > aParser bParser combinedParser aTime bTime cTime |
>
> aParser := #digit asParser.
> bParser := #letter asParser.
> combinedParser := aParser / bParser.
>
> aTime := [  aParser parse: 'b'] bench.
> bTime := [ bParser parse: 'b'] bench.
> cTime := [ combinedParser parse: 'b'] bench.
> { aTime. bTime. cTime }
>
> cTime is bounded by the time execution time of of the slowest
> of these parsers + overhead (e.g. PPFailure creation).
>
> e.g.
>
> #('559,000 per second.' '1,010,000 per second.' '429,000 per
> second.')
>
> With a proof of concept PPPredicateCharSetParser
>
> #('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per
> second.’)
>
> The noise is pretty string here but what is important is that bParser
> and the
> combinedParser are in the same ballpark.
>
> 2.) Choice Parser
>
>
>
> The BNF grammar of the parser is roughly:
>
> Request        =  Request-Line
>                   *( message-header )
>                   CRLF
>                   [ message-body ]
>
> message-header  =  (Accept
>               …
>               / To
>               / From
>               / Via
>               /  extension-header) CRLF
>
> Alert-Info   =  "Alert-Info" HCOLON alert-param *(COMMA alert-param)
> Accept         =  "Accept" HCOLON
>                    [ accept-range *(COMMA accept-range) ]
>
>
> So there can be several lines of “message-header”. And each method
> header
> starts with a token/word, a colon and then the parameter. “extension
> -header”
> is kind of a catch all if no other rule matched. E.g. if a client
> sends a To which is
> wrongly encoded it would end up with the extension-header.
>
> I transferred the above naively to PetitParser and end up with
> something like
> parsing ~500 messages a second. The main cost appears to come from
> the
> choice parser that needs to create a PetitFailure all the time. E.g.
> if you have a
> line like this:
>
>       ‘From: “Holger Freyther” <[hidden email]>’
>
> The choice parser will start with the “Accept” rule, parse the token
> (“From” and
> then create a PPFailure, then … rules, then “To”, parse the token..
> So we have
> parsing the same token more than once and creating PPFailures all the
> time. I
> ended up creating a custom parser that will peek the token, have a
> horrible chain
> of token = ‘XYZ’ ifTrue and then dispatch to the other rule.
>
> It would be nice if PetitParser could be taught to only parse the
> token once and
> then delegate to the param rule. E.g. a PPAnyOfParser that allows to
> specify the
> token to match, the parser to continue with and a fallback parser?
>
>
>
> [1]
> http://smalltalkhub.com/#!/~osmocom/SIP
> http://smalltalkhub.com/#!/~osmocom/MGCP
>
Reply | Threaded
Open this post in threaded view
|

Re: PetitParser speed and cost of PPFailure

Holger Freyther

> On 17 Aug 2015, at 13:03, Jan Kurš <[hidden email]> wrote:
>
>

Dear Jan,


> As the compiler is not production ready and as we are looking for a use cases, I can import your grammar and let you know how to make it working with our tool and what would be the performance gain. In that case, I would need some test cases and benchmark inputs from you.

sorry for the delay. The parser/tests are contained in the OsmoSIP code and
there is a metacello configuration[1] as well. The code does load in a Pharo3.0
image (and it should work in Pharo4.0 as well).

As for benchmarks the closes would be to parse specific messages. E.g. three
of them of the SIPParserTest,
       
        >>#authorizationData,
        >>#resultUnauthorized
        >>#statusResponseData
        class >>#exampleYateBye

Not a good test corpus (1 request, three responses) but it should give you a
ballpark figure if things go more quick. You might want to undo my hack I did
in commit #18 [2].

kind regards
        holger


[1] http://smalltalkhub.com/#!/~osmocom/SIP/source
[2] http://smalltalkhub.com/#!/~osmocom/SIP/diff/OsmoSIP-HolgerHansPeterFreyther.19