SmaCC parsing "rest until"

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

SmaCC parsing "rest until"

Manuel Leuenberger
Hi,

I am currently struggling with understanding why my SmaCC parser refuses to parse a rather simple language. Grammar and example are below.
The language consists of multiple leak reports, separated by two newlines. Each leak report has a type and object and byte size on the first line. Then comes a stack of indented frames with addresses, and method and source location if available. Somehow my parser always fails because it trying to scan the whole string as <method>. I am struggling with this since hours but not making any progress.

How do I need to rewrite the grammar so that the scanner produces the right tokens for the parser?

I understand that the <method> token use in Source is problematic, because it needs to match until the last word. But I cannot find a way to write it in another way or tweak the parser attributes to figure out the right tokenization.

Anybody has some hints? I would greatly appreciate it.

Cheers,
Manuel


Example:

LeReParser parse: 'Indirect leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x106be248c in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x5d48c)
    #1 0x12193989d in moz_xmalloc mozalloc.cpp:83
    #2 0x11fc8200d in std::__1::vector<unsigned int, std::__1::allocator<unsigned int> >::__append(unsigned long) mozalloc.h:194
    #3 0x11fc80d5a in mozilla::gfx::GradientStopsSkia::GradientStopsSkia(std::__1::vector<mozilla::gfx::GradientStop, std::__1::allocator<mozilla::gfx::GradientStop> > const&, unsigned int, mozilla::gfx::ExtendMode) vector:2041
    #4 0x11fc6f1dc in mozilla::gfx::DrawTargetSkia::CreateGradientStops(mozilla::gfx::GradientStop*, unsigned int, mozilla::gfx::ExtendMode) const DrawTargetSkia.cpp:62
    #5 0x122236b1e in create_gradient_stops(mozilla::gfx::DrawTarget*, float*, unsigned int, mozilla::gfx::ExtendMode) pattern.cpp:98
    #6 0x122236323 in moz2d_pattern_linear_gradient_create_flat pattern.cpp:40
    #7 0x1069cf901 in primitiveCalloutWithArgs (Pharo:x86_64+0x1003a5901)
    #8 0x10669e222 in primitiveExternalCall gcc3x-cointerp.c:76887
    #9 0x106694aed in executeNewMethod gcc3x-cointerp.c:22341
    #10 0x10669a403 in ceSendsupertonumArgs gcc3x-cointerp.c:16540
    #11 0x11043c134  (<unknown module>)
    #12 0x10662d712 in interpret gcc3x-cointerp.c:2754
    #13 0x1068b2ada in -[sqSqueakMainApplication runSqueak] sqSqueakMainApplication.m:201
    #14 0x7fffc93786fc in __NSFirePerformWithOrder (Foundation:x86_64+0xd76fc)
    #15 0x7fffc78cfc56 in __CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ (CoreFoundation:x86_64h+0xa6c56)
    #16 0x7fffc78cfbc6 in __CFRunLoopDoObservers (CoreFoundation:x86_64h+0xa6bc6)
    #17 0x7fffc78b05f8 in __CFRunLoopRun (CoreFoundation:x86_64h+0x875f8)
    #18 0x7fffc78b0033 in CFRunLoopRunSpecific (CoreFoundation:x86_64h+0x87033)
    #19 0x7fffc6e10ebb in RunCurrentEventLoopInMode (HIToolbox:x86_64+0x30ebb)
    #20 0x7fffc6e10bf8 in ReceiveNextEventCommon (HIToolbox:x86_64+0x30bf8)
    #21 0x7fffc6e10b25 in _BlockUntilNextEventMatchingListInModeWithFilter (HIToolbox:x86_64+0x30b25)
    #22 0x7fffc53a5a53 in _DPSNextEvent (AppKit:x86_64+0x46a53)
    #23 0x7fffc5b217ed in -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] (AppKit:x86_64+0x7c27ed)
    #24 0x7fffc539a3da in -[NSApplication run] (AppKit:x86_64+0x3b3da)
    #25 0x7fffc5364e0d in NSApplicationMain (AppKit:x86_64+0x5e0d)
    #26 0x7fffdd0a2234 in start (libdyld.dylib:x86_64+0x5234)

'.

Grammar:

%prefix LeRe;
%suffix Node;
%root Report;

<newline>
        : \r \n? | \n
        ;
<number>
        : \d+
        ;
<index>
        : \# \d+
        ;
<address>
        : 0 x [0-9a-f]+
        ;
<method>
        : [^\r\n]+
        ;
<file>
        : \S+
        ;

Report
        : Leaks 'leaks' { leaks }
        ;
Leaks
        : Leak 'leak' { { leak } }
        | Leaks 'leaks' Leak 'leak' { leaks , { leak } }
        ;
Leak
        : LeakType 'type' <number> 'bytes' " byte(s) in " <number> 'objects' " object(s) allocated from:" <newline> Stack 'stack' <newline> { { type . bytes value . objects value . stack } }
        ;
LeakType
        : "Indirect leak of " { #indirect }
        | "Direct leak of " { #direct }
        ;
Stack
        : Frames 'frames' { frames }
        ;
Frames
        : Frame 'frame' { { frame } }
        | Frames 'frames' Frame 'frame' { frames , { frame } }
        ;
 Frame
        : "    " <index> " " <address> " " Source 'source' <newline> { source }
        ;
Source
        : " (<unknown module>)" { #unknown }
        | "in " <method> 'method' " " <file> 'file' { { method value . file value } }
        ;
Reply | Threaded
Open this post in threaded view
|

Re: SmaCC parsing "rest until"

Manuel Leuenberger
Hi Thierry,

Thanks for the ideas and references. I also tried parsing with PetitParser2, because with PEGs I can have unlimited lookahead. Still, I did not get it working quickly, and it was ugly anyway. One way would also be to just tokenize the whole remainder of the line into words, and rebuild the rest without the last word in a second step. Ultimately, it was easier to write a hand-knitted parser that does exactly that (https://github.com/maenu/leak-reporter/blob/master/src/pharo/LeakReporter/LeReParser.class.st), the structure is easy enough.

Cheers,
Manuel

On 30 Jan 2019, at 19:19, Thierry Goubier <[hidden email]> wrote:

Le 29/01/2019 à 22:17, Manuel Leuenberger a écrit :
Hi,
I am currently struggling with understanding why my SmaCC parser refuses to parse a rather simple language. Grammar and example are below.
The language consists of multiple leak reports, separated by two newlines. Each leak report has a type and object and byte size on the first line. Then comes a stack of indented frames with addresses, and method and source location if available. Somehow my parser always fails because it trying to scan the whole string as <method>. I am struggling with this since hours but not making any progress.
How do I need to rewrite the grammar so that the scanner produces the right tokens for the parser?
I understand that the <method> token use in Source is problematic, because it needs to match until the last word. But I cannot find a way to write it in another way or tweak the parser attributes to figure out the right tokenization.

Hi Manuel,

the problem is effectively your <method> tag, which is eating next to everything.

In such a situation, you could use a state in your scanner, limit <method> to that state, and switch in and out of that state in the parser actions.

something like:

inMethod <method> :  ;

in: "in" {self state: #inMethod};

method: <method> {self state: #default};

See section 3.6 in the SmaCC booklet.

Anybody has some hints? I would greatly appreciate it.

The scanner state may help, but I think your <method> tag has to be better defined, else you will be unable to get the <file> part.

Source
: " (<unknown module>)" { #unknown }
| "in " <method> 'method' " " <file> 'file' { { method value . file value } }
;

Avoid having spaces into the terminals (i.e. "in "), and it won't probably match the " ".

What I can only suggest is to write a <method> (or method non terminal) closed by a space, and only allowing spaces if inside enclosing parenthesis "(" ")" or brackets "<" ">"; what is after the space is the <file>.

Hope this helps,

Thierry

Cheers,
Manuel
Example:
LeReParser parse: 'Indirect leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x106be248c in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x5d48c)
    #1 0x12193989d in moz_xmalloc mozalloc.cpp:83
    #2 0x11fc8200d in std::__1::vector<unsigned int, std::__1::allocator<unsigned int> >::__append(unsigned long) mozalloc.h:194
    #3 0x11fc80d5a in mozilla::gfx::GradientStopsSkia::GradientStopsSkia(std::__1::vector<mozilla::gfx::GradientStop, std::__1::allocator<mozilla::gfx::GradientStop> > const&, unsigned int, mozilla::gfx::ExtendMode) vector:2041
    #4 0x11fc6f1dc in mozilla::gfx::DrawTargetSkia::CreateGradientStops(mozilla::gfx::GradientStop*, unsigned int, mozilla::gfx::ExtendMode) const DrawTargetSkia.cpp:62
    #5 0x122236b1e in create_gradient_stops(mozilla::gfx::DrawTarget*, float*, unsigned int, mozilla::gfx::ExtendMode) pattern.cpp:98
    #6 0x122236323 in moz2d_pattern_linear_gradient_create_flat pattern.cpp:40
    #7 0x1069cf901 in primitiveCalloutWithArgs (Pharo:x86_64+0x1003a5901)
    #8 0x10669e222 in primitiveExternalCall gcc3x-cointerp.c:76887
    #9 0x106694aed in executeNewMethod gcc3x-cointerp.c:22341
    #10 0x10669a403 in ceSendsupertonumArgs gcc3x-cointerp.c:16540
    #11 0x11043c134  (<unknown module>)
    #12 0x10662d712 in interpret gcc3x-cointerp.c:2754
    #13 0x1068b2ada in -[sqSqueakMainApplication runSqueak] sqSqueakMainApplication.m:201
    #14 0x7fffc93786fc in __NSFirePerformWithOrder (Foundation:x86_64+0xd76fc)
    #15 0x7fffc78cfc56 in __CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ (CoreFoundation:x86_64h+0xa6c56)
    #16 0x7fffc78cfbc6 in __CFRunLoopDoObservers (CoreFoundation:x86_64h+0xa6bc6)
    #17 0x7fffc78b05f8 in __CFRunLoopRun (CoreFoundation:x86_64h+0x875f8)
    #18 0x7fffc78b0033 in CFRunLoopRunSpecific (CoreFoundation:x86_64h+0x87033)
    #19 0x7fffc6e10ebb in RunCurrentEventLoopInMode (HIToolbox:x86_64+0x30ebb)
    #20 0x7fffc6e10bf8 in ReceiveNextEventCommon (HIToolbox:x86_64+0x30bf8)
    #21 0x7fffc6e10b25 in _BlockUntilNextEventMatchingListInModeWithFilter (HIToolbox:x86_64+0x30b25)
    #22 0x7fffc53a5a53 in _DPSNextEvent (AppKit:x86_64+0x46a53)
    #23 0x7fffc5b217ed in -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] (AppKit:x86_64+0x7c27ed)
    #24 0x7fffc539a3da in -[NSApplication run] (AppKit:x86_64+0x3b3da)
    #25 0x7fffc5364e0d in NSApplicationMain (AppKit:x86_64+0x5e0d)
    #26 0x7fffdd0a2234 in start (libdyld.dylib:x86_64+0x5234)
'.
Grammar:
%prefix LeRe;
%suffix Node;
%root Report;
<newline>
: \r \n? | \n
;
<number>
: \d+
;
<index>
: \# \d+
;
<address>
: 0 x [0-9a-f]+
;
<method>
: [^\r\n]+
;
<file>
: \S+
;
Report
: Leaks 'leaks' { leaks }
;
Leaks
: Leak 'leak' { { leak } }
| Leaks 'leaks' Leak 'leak' { leaks , { leak } }
;
Leak
: LeakType 'type' <number> 'bytes' " byte(s) in " <number> 'objects' " object(s) allocated from:" <newline> Stack 'stack' <newline> { { type . bytes value . objects value . stack } }
;
LeakType
: "Indirect leak of " { #indirect }
| "Direct leak of " { #direct }
;
Stack
: Frames 'frames' { frames }
;
Frames
: Frame 'frame' { { frame } }
| Frames 'frames' Frame 'frame' { frames , { frame } }
;
 Frame
: "    " <index> " " <address> " " Source 'source' <newline> { source }
;
Source
: " (<unknown module>)" { #unknown }
| "in " <method> 'method' " " <file> 'file' { { method value . file value } }
;