Hi,
once again I am not sure if this is the right list. The first parser I wrote using PetitParser was a SIP (and then MGCP) parser. I have recently ported[1] the code to Pharo and with Pharo it is very tempting to Use BlockClosure>>#bench to get an idea of the speed. I have two performance “issues” and wonder if others hand similar issues with PetitParser and if there is a general approach to this. 1.) Combining two PPCharsetPredicates does not combine the “classification” table it had. One could create a PPPredicateObjectParser subclass that is special casing >>#/ to build a combined classification table. 2.) When blindly following a BNF enumeration of "A or B or C or D or E or CatchAll” and each “A, B” follow common pattern (e.g. token COLON value) one pays a high cost in the backtracking and constructing the PPFailure for each failed case. In my SIPGrammar I have action parsers for To ==>.. From ==> and would like to keep that. At the same time I would be happy if the token in front of the colon is only consumed once and then delegated to the right parser and if that one failed use the ‘catch all’ one. I don’t know which abstraction would be needed to allow creating optimized PetitParsers for such grammars. sorry for the long mail, long details and context is below. kind regards holger Full details: 1.) CharSetPredicate | aParser bParser combinedParser aTime bTime cTime | aParser := #digit asParser. bParser := #letter asParser. combinedParser := aParser / bParser. aTime := [ aParser parse: 'b'] bench. bTime := [ bParser parse: 'b'] bench. cTime := [ combinedParser parse: 'b'] bench. { aTime. bTime. cTime } cTime is bounded by the time execution time of of the slowest of these parsers + overhead (e.g. PPFailure creation). e.g. #('559,000 per second.' '1,010,000 per second.' '429,000 per second.') With a proof of concept PPPredicateCharSetParser #('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per second.’) The noise is pretty string here but what is important is that bParser and the combinedParser are in the same ballpark. 2.) Choice Parser The BNF grammar of the parser is roughly: Request = Request-Line *( message-header ) CRLF [ message-body ] message-header = (Accept … / To / From / Via / extension-header) CRLF Alert-Info = "Alert-Info" HCOLON alert-param *(COMMA alert-param) Accept = "Accept" HCOLON [ accept-range *(COMMA accept-range) ] So there can be several lines of “message-header”. And each method header starts with a token/word, a colon and then the parameter. “extension-header” is kind of a catch all if no other rule matched. E.g. if a client sends a To which is wrongly encoded it would end up with the extension-header. I transferred the above naively to PetitParser and end up with something like parsing ~500 messages a second. The main cost appears to come from the choice parser that needs to create a PetitFailure all the time. E.g. if you have a line like this: ‘From: “Holger Freyther” <sip:[hidden email]>’ The choice parser will start with the “Accept” rule, parse the token (“From” and then create a PPFailure, then … rules, then “To”, parse the token.. So we have parsing the same token more than once and creating PPFailures all the time. I ended up creating a custom parser that will peek the token, have a horrible chain of token = ‘XYZ’ ifTrue and then dispatch to the other rule. It would be nice if PetitParser could be taught to only parse the token once and then delegate to the param rule. E.g. a PPAnyOfParser that allows to specify the token to match, the parser to continue with and a fallback parser? [1] http://smalltalkhub.com/#!/~osmocom/SIP http://smalltalkhub.com/#!/~osmocom/MGCP |
You will probably have more chances to get a response on the Moose mailing list. Cheers, Doru On Sun, Aug 16, 2015 at 5:58 PM, Holger Freyther <[hidden email]> wrote: Hi, |
In reply to this post by Holger Freyther
Hi holger
At ESUG Jan showed that he is working on a kind of PetitParser compiler. Le 16/8/15 17:58, Holger Freyther a écrit : > Hi, > > once again I am not sure if this is the right list. The first parser I wrote using > PetitParser was a SIP (and then MGCP) parser. I have recently ported[1] the > code to Pharo and with Pharo it is very tempting to Use BlockClosure>>#bench > to get an idea of the speed. > > > I have two performance “issues” and wonder if others hand similar issues with > PetitParser and if there is a general approach to this. > > > 1.) Combining two PPCharsetPredicates does not combine the “classification” > table it had. One could create a PPPredicateObjectParser subclass that is > special casing >>#/ to build a combined classification table. > > 2.) When blindly following a BNF enumeration of "A or B or C or D or E > or CatchAll” and each “A, B” follow common pattern (e.g. token COLON value) > one pays a high cost in the backtracking and constructing the PPFailure for > each failed case. > > In my SIPGrammar I have action parsers for To ==>.. From ==> and would > like to keep that. At the same time I would be happy if the token in front of the > colon is only consumed once and then delegated to the right parser and if that > one failed use the ‘catch all’ one. > > I don’t know which abstraction would be needed to allow creating optimized > PetitParsers for such grammars. > > sorry for the long mail, long details and context is below. grammars and the grammar is not evolving and you need speed then a SmaCC approache with LGR, LALR or something else is probably faster. Stef > > > kind regards > holger > > > > > > > Full details: > > > 1.) CharSetPredicate > > | aParser bParser combinedParser aTime bTime cTime | > > aParser := #digit asParser. > bParser := #letter asParser. > combinedParser := aParser / bParser. > > aTime := [ aParser parse: 'b'] bench. > bTime := [ bParser parse: 'b'] bench. > cTime := [ combinedParser parse: 'b'] bench. > { aTime. bTime. cTime } > > cTime is bounded by the time execution time of of the slowest > of these parsers + overhead (e.g. PPFailure creation). > > e.g. > > #('559,000 per second.' '1,010,000 per second.' '429,000 per second.') > > With a proof of concept PPPredicateCharSetParser > > #('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per second.’) > > The noise is pretty string here but what is important is that bParser and the > combinedParser are in the same ballpark. > > 2.) Choice Parser > > > > The BNF grammar of the parser is roughly: > > Request = Request-Line > *( message-header ) > CRLF > [ message-body ] > > message-header = (Accept > … > / To > / From > / Via > / extension-header) CRLF > > Alert-Info = "Alert-Info" HCOLON alert-param *(COMMA alert-param) > Accept = "Accept" HCOLON > [ accept-range *(COMMA accept-range) ] > > > So there can be several lines of “message-header”. And each method header > starts with a token/word, a colon and then the parameter. “extension-header” > is kind of a catch all if no other rule matched. E.g. if a client sends a To which is > wrongly encoded it would end up with the extension-header. > > I transferred the above naively to PetitParser and end up with something like > parsing ~500 messages a second. The main cost appears to come from the > choice parser that needs to create a PetitFailure all the time. E.g. if you have a > line like this: > > ‘From: “Holger Freyther” <sip:[hidden email]>’ > > The choice parser will start with the “Accept” rule, parse the token (“From” and > then create a PPFailure, then … rules, then “To”, parse the token.. So we have > parsing the same token more than once and creating PPFailures all the time. I > ended up creating a custom parser that will peek the token, have a horrible chain > of token = ‘XYZ’ ifTrue and then dispatch to the other rule. > > It would be nice if PetitParser could be taught to only parse the token once and > then delegate to the param rule. E.g. a PPAnyOfParser that allows to specify the > token to match, the parser to continue with and a fallback parser? > > > > [1] > http://smalltalkhub.com/#!/~osmocom/SIP > http://smalltalkhub.com/#!/~osmocom/MGCP > |
In reply to this post by Holger Freyther
Hi Holger,
Jan and me are working on a PetitCompiler [1,2], a tool whose aim is to compile PetitParsers to Smalltalk to gain performance and allow for hand-tuning. In short, we try to generate a tokenizer and LL(1) parser on top of it. In general case, given the flexibility of PetitParser, this is rather tricky :-) The results are - performance wise - not bad, IMO [3]. As long as your grammar is sort of LL(1) it should not backtrack (but Jan would give you more details, this is his domain) Now, the whole thing is work-in-progress and by no means is - as of today - production-ready tool. I'd like to give it a try and benchmark it as it would make nice case-study for us. But - no promises time-wise. I'm bloody busy now and this is a sort of spare time-project for me. Best, Jan [1] https://bitbucket.org/janvrany/stx-goodies-petitparser [2] http://smalltalkhub.com/#!/~JanKurs/PetitParser/ [3] http://bit.ly/1IXnz0c On Sun, 2015-08-16 at 17:58 +0200, Holger Freyther wrote: > Hi, > > once again I am not sure if this is the right list. The first parser > I wrote using > PetitParser was a SIP (and then MGCP) parser. I have recently > ported[1] the > code to Pharo and with Pharo it is very tempting to Use > BlockClosure>>#bench > to get an idea of the speed. > > > I have two performance “issues” and wonder if others hand similar > issues with > PetitParser and if there is a general approach to this. > > > > 1.) Combining two PPCharsetPredicates does not combine the > “classification” > table it had. One could create a PPPredicateObjectParser subclass > that is > special casing >>#/ to build a combined classification table. > > > 2.) When blindly following a BNF enumeration of "A or B or C or D or > E > or CatchAll” and each “A, B” follow common pattern (e.g. token COLON > value) > one pays a high cost in the backtracking and constructing the > PPFailure for > each failed case. > > In my SIPGrammar I have action parsers for To ==>.. From ==> and > would > like to keep that. At the same time I would be happy if the token in > front of the > colon is only consumed once and then delegated to the right parser > and if that > one failed use the ‘catch all’ one. > > I don’t know which abstraction would be needed to allow creating > optimized > PetitParsers for such grammars. > > sorry for the long mail, long details and context is below. > > > kind regards > holger > > > > > > > Full details: > > > 1.) CharSetPredicate > > > aParser bParser combinedParser aTime bTime cTime | > > aParser := #digit asParser. > bParser := #letter asParser. > combinedParser := aParser / bParser. > > aTime := [ aParser parse: 'b'] bench. > bTime := [ bParser parse: 'b'] bench. > cTime := [ combinedParser parse: 'b'] bench. > { aTime. bTime. cTime } > > cTime is bounded by the time execution time of of the slowest > of these parsers + overhead (e.g. PPFailure creation). > > e.g. > > #('559,000 per second.' '1,010,000 per second.' '429,000 per > second.') > > With a proof of concept PPPredicateCharSetParser > > #('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per > second.’) > > The noise is pretty string here but what is important is that bParser > and the > combinedParser are in the same ballpark. > > 2.) Choice Parser > > > > The BNF grammar of the parser is roughly: > > Request = Request-Line > *( message-header ) > CRLF > [ message-body ] > > message-header = (Accept > … > / To > / From > / Via > / extension-header) CRLF > > Alert-Info = "Alert-Info" HCOLON alert-param *(COMMA alert-param) > Accept = "Accept" HCOLON > [ accept-range *(COMMA accept-range) ] > > > So there can be several lines of “message-header”. And each method > header > starts with a token/word, a colon and then the parameter. “extension > -header” > is kind of a catch all if no other rule matched. E.g. if a client > sends a To which is > wrongly encoded it would end up with the extension-header. > > I transferred the above naively to PetitParser and end up with > something like > parsing ~500 messages a second. The main cost appears to come from > the > choice parser that needs to create a PetitFailure all the time. E.g. > if you have a > line like this: > > ‘From: “Holger Freyther” <sip:[hidden email]>’ > > The choice parser will start with the “Accept” rule, parse the token > (“From” and > then create a PPFailure, then … rules, then “To”, parse the token.. > So we have > parsing the same token more than once and creating PPFailures all the > time. I > ended up creating a custom parser that will peek the token, have a > horrible chain > of token = ‘XYZ’ ifTrue and then dispatch to the other rule. > > It would be nice if PetitParser could be taught to only parse the > token once and > then delegate to the param rule. E.g. a PPAnyOfParser that allows to > specify the > token to match, the parser to continue with and a fallback parser? > > > > [1] > http://smalltalkhub.com/#!/~osmocom/SIP > http://smalltalkhub.com/#!/~osmocom/MGCP > |
Thank you for bringing this. In the first case, I suggest to create a specialised parser, that fills in the classification as you desire. In the second case, the left refactoring of the common prefix is the way to go. As you are speaking about general solution, The other approach, as the first Jan already mentioned, is a compiler that takes PetitParser as an input and produces a class representing a parser that recognizes equivalent language and that should resemble hand-written top-down parser. During the compilation, we do many optimizations and I believe we can get decent performance. As the compiler is not production ready and as we are looking for a use cases, I can import your grammar and let you know how to make it working with our tool and what would be the performance gain. In that case, I would need some test cases and benchmark inputs from you. Cheers Jan On Mon, 17 Aug 2015 08:30 Jan Vrany <[hidden email]> wrote: Hi Holger, |
> On 17 Aug 2015, at 13:03, Jan Kurš <[hidden email]> wrote: > > Dear Jan, > As the compiler is not production ready and as we are looking for a use cases, I can import your grammar and let you know how to make it working with our tool and what would be the performance gain. In that case, I would need some test cases and benchmark inputs from you. sorry for the delay. The parser/tests are contained in the OsmoSIP code and there is a metacello configuration[1] as well. The code does load in a Pharo3.0 image (and it should work in Pharo4.0 as well). As for benchmarks the closes would be to parse specific messages. E.g. three of them of the SIPParserTest, >>#authorizationData, >>#resultUnauthorized >>#statusResponseData class >>#exampleYateBye Not a good test corpus (1 request, three responses) but it should give you a ballpark figure if things go more quick. You might want to undo my hack I did in commit #18 [2]. kind regards holger [1] http://smalltalkhub.com/#!/~osmocom/SIP/source [2] http://smalltalkhub.com/#!/~osmocom/SIP/diff/OsmoSIP-HolgerHansPeterFreyther.19 |
Free forum by Nabble | Edit this page |