Hi,
I'd like to define a parser for the following grammer: Test: Noise "foo" Noise; Noise: <noise>; noise: .*; I'd expect that this successfully parses "__foo", but in fact it does not parse anything successfully. However, changing the definition to: Test: Noise* "foo" Noise*; Noise: <noise> noise: .; does work. Can anybody what makes the difference between these two versions, which I've expected to be equal? THX, Steffen _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Steffen,
Sure. The difference is in the grammar expectations. In the first case, Test: Noise "foo" Noise the grammar says "Find one occurrence of Noise, followed by the literal 'foo', followed another occurrence of Noise." Thus, the input string of "__foo" doesn't match the definitions provided because there is nothing that corresponds to a Noise non-terminal. Even if you had a Noise before and after, I'm not sure it would parse because it wouldn't be able to handle the "__". The second case: Test: Noise* "foo" Noise* states, "Find zero or more occurrences of Noise, followed by the literal 'foo', followed by zero or more occurrences of Noise." In this case, it should parse the input string "foo", but I'm not sure it would or should parse "__foo" (again, those "__"). I imagine your grammar is abbreviated, but the definition of "noise" would eat any and all '>' because '.' doesn't differentiate. Thus, the instant you add any '<', all remaining input would become part of "noise". Should it really be "[^>]*"? Cheers! Tom Hawker -------------------------- Senior Framework Developer -------------------------- Home +1 (408) 274-4128 Office +1 (408) 576-6591 Mobile +1 (408) 835-3643 -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Steffen Märcker Sent: Tuesday, October 20, 2009 10:47 AM To: vwnc Subject: [vwnc] SmaCC, matching arbitrary character sequence Hi, I'd like to define a parser for the following grammer: Test: Noise "foo" Noise; Noise: <noise>; noise: .*; I'd expect that this successfully parses "__foo", but in fact it does not parse anything successfully. However, changing the definition to: Test: Noise* "foo" Noise*; Noise: <noise> noise: .; does work. Can anybody what makes the difference between these two versions, which I've expected to be equal? THX, Steffen _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc IMPORTANT NOTICE Email from OOCL is confidential and may be legally privileged. If it is not intended for you, please delete it immediately unread. The internet cannot guarantee that this communication is free of viruses, interception or interference and anyone who communicates with us by email is taken to accept the risks in doing so. Without limitation, OOCL and its affiliates accept no liability whatsoever and howsoever arising in connection with the use of this email. Under no circumstances shall this email constitute a binding agreement to carry or for provision of carriage services by OOCL, which is subject to the availability of carrier's equipment and vessels and the terms and conditions of OOCL's standard bill of lading which is also available at http://www.oocl.com. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steffen Märcker
Steffen Märcker wrote:
> I'd like to define a parser for the following grammer: > > Test: Noise "foo" Noise; > Noise: <noise>; > noise: .*; > > I'd expect that this successfully parses "__foo", but in fact it does not > parse anything successfully. > However, changing the definition to: > > Test: Noise* "foo" Noise*; > Noise: <noise> > noise: .; > > does work. Can anybody what makes the difference between these two > versions, which I've expected to be equal? The scanner is greedy and matches the longest string possible. Therefore, in your first example, the scanner creates a single <noise> token: "__foo". The parser fails since you don't have a "foo" token. In the second example, your <noise> token is only a single character so the scanner creates two <noise> "_" tokens followed by a "foo" token. If you want to use the first example, you'll need to either update your <noise> regex ".*" to not include the "foo" keyword or you'll need to write a method on your scanner class to handle this. John Brant _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by thomas.hawker
I suspect that you mean:
- Parser spec - > Test: Noise "foo" Noise; > Noise: <noise>; - Scanner spec - > noise: .*; And iirc, as I think Tom indicates below, the scanner above will chew up everything, leaving nothing. In your second pair of specs, the scanner chews only one character at a time, leaving the parser something to look at. - Dave > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] On Behalf > Of [hidden email] > Sent: Tuesday, October 20, 2009 11:20 AM > To: [hidden email]; [hidden email] > Subject: Re: [vwnc] SmaCC, matching arbitrary character sequence > > Steffen, > > Sure. The difference is in the grammar expectations. In the first case, > > Test: Noise "foo" Noise > > the grammar says "Find one occurrence of Noise, followed by the literal > 'foo', followed another occurrence of Noise." Thus, the input string of > "__foo" doesn't match the definitions provided because there is nothing > that corresponds to a Noise non-terminal. Even if you had a Noise before > and after, I'm not sure it would parse because it wouldn't be able to > handle the "__". > > The second case: > > Test: Noise* "foo" Noise* > > states, "Find zero or more occurrences of Noise, followed by the literal > 'foo', followed by zero or more occurrences of Noise." In this case, it > should parse the input string "foo", but I'm not sure it would or should > parse "__foo" (again, those "__"). > > I imagine your grammar is abbreviated, but the definition of "noise" would > eat any and all '>' because '.' doesn't differentiate. Thus, the instant > you add any '<', all remaining input would become part of "noise". Should > it really be "[^>]*"? > > Cheers! > > Tom Hawker > -------------------------- > Senior Framework Developer > -------------------------- > Home +1 (408) 274-4128 > Office +1 (408) 576-6591 > Mobile +1 (408) 835-3643 > > > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] On Behalf > Of Steffen Märcker > Sent: Tuesday, October 20, 2009 10:47 AM > To: vwnc > Subject: [vwnc] SmaCC, matching arbitrary character sequence > > Hi, > > I'd like to define a parser for the following grammer: > > Test: Noise "foo" Noise; > Noise: <noise>; > noise: .*; > > I'd expect that this successfully parses "__foo", but in fact it does not > parse anything successfully. > However, changing the definition to: > > Test: Noise* "foo" Noise*; > Noise: <noise> > noise: .; > > does work. Can anybody what makes the difference between these two > versions, which I've expected to be equal? > > THX, > Steffen > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > IMPORTANT NOTICE > Email from OOCL is confidential and may be legally privileged. If it is > not > intended for you, please delete it immediately unread. The internet > cannot guarantee that this communication is free of viruses, interception > or interference and anyone who communicates with us by email is taken > to accept the risks in doing so. Without limitation, OOCL and its > affiliates > accept no liability whatsoever and howsoever arising in connection with > the use of this email. Under no circumstances shall this email constitute > a binding agreement to carry or for provision of carriage services by > OOCL, > which is subject to the availability of carrier's equipment and vessels > and > the terms and conditions of OOCL's standard bill of lading which is also > available at http://www.oocl.com. > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by thomas.hawker
Thank you, Tom, John and Wallen for your detailed explanation. This is the
first time I am using a real parser instead of ordinary regex matching. The example was in fact simplified. Actually I want to parse some input, eventually containing parts of interest and everything else can be ignored. E.g. "Some really (* this is important *) unimportant(* this too*) stuff." The (* ... *) parts are well defined by some grammer, but I am unsure how to get easily rid of the garbage between them. Cheers! Steffen Am 20.10.2009, 20:20 Uhr, schrieb <[hidden email]>: > Steffen, > > Sure. The difference is in the grammar expectations. In the first case, > > Test: Noise "foo" Noise > > the grammar says "Find one occurrence of Noise, followed by the literal > 'foo', followed another occurrence of Noise." Thus, the input string of > "__foo" doesn't match the definitions provided because there is nothing > that corresponds to a Noise non-terminal. Even if you had a Noise > before and after, I'm not sure it would parse because it wouldn't be > able to handle the "__". > > The second case: > > Test: Noise* "foo" Noise* > > states, "Find zero or more occurrences of Noise, followed by the literal > 'foo', followed by zero or more occurrences of Noise." In this case, it > should parse the input string "foo", but I'm not sure it would or should > parse "__foo" (again, those "__"). > > I imagine your grammar is abbreviated, but the definition of "noise" > would eat any and all '>' because '.' doesn't differentiate. Thus, the > instant you add any '<', all remaining input would become part of > "noise". Should it really be "[^>]*"? > > Cheers! > Tom Hawker > -------------------------- > Senior Framework Developer > -------------------------- > Home +1 (408) 274-4128 > Office +1 (408) 576-6591 > Mobile +1 (408) 835-3643 > > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] On > Behalf Of Steffen Märcker > Sent: Tuesday, October 20, 2009 10:47 AM > To: vwnc > Subject: [vwnc] SmaCC, matching arbitrary character sequence > > Hi, > > I'd like to define a parser for the following grammer: > > Test: Noise "foo" Noise; > Noise: <noise>; > noise: .*; > > I'd expect that this successfully parses "__foo", but in fact it does not > parse anything successfully. > However, changing the definition to: > > Test: Noise* "foo" Noise*; > Noise: <noise> > noise: .; > > does work. Can anybody what makes the difference between these two > versions, which I've expected to be equal? > > THX, > Steffen > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > IMPORTANT NOTICE > Email from OOCL is confidential and may be legally privileged. If it is > not > intended for you, please delete it immediately unread. The internet > cannot guarantee that this communication is free of viruses, interception > or interference and anyone who communicates with us by email is taken > to accept the risks in doing so. Without limitation, OOCL and its > affiliates > accept no liability whatsoever and howsoever arising in connection with > the use of this email. Under no circumstances shall this email > constitute > a binding agreement to carry or for provision of carriage services by > OOCL, > which is subject to the availability of carrier's equipment and vessels > and > the terms and conditions of OOCL's standard bill of lading which is also > available at http://www.oocl.com. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Steffen,
I’m not sure what you’re parsing. Is it just a comment (looking for specific stuff within the special delimiters) or is it something more comprehensive? The reason I ask is you don’t necessarily want to look for “noise* (‘(*’ content ‘)*’ noise*)*”. For example, if you’re scanning through a programming language, then you really need to look at lexemes, otherwise you might falsely recognize ‘(*’ inside a string.
Assuming you’re only scanning arbitray text, not a program or piece of a program where you have to worry about lexemes, I think you want something like this:
Parser: Start: Text; Text: Noise | Text Delimit Noise; Noise: | Noise <noise>; Delimit: <begin> Content <end>; Content: ... Scanner: begin: \(\*; end: \*\); noise: .
This will skip any noise until the first begin delimiter, process any content until the first following end delimiter, and repeat, skipping noise between delimited text as per your example. This makes sure that all text is scanned so that the parser thinks it has not stopped prematurely.
Cheers!
Tom Hawker -------------------------- Senior Framework Developer -------------------------- Home +1 (408) 274-4128 Office +1 (408) 576-6591 Mobile +1 (408) 835-3643
-----Original Message-----
Thank you, Tom, John and Wallen for your detailed explanation. This is the first time I am using a real parser instead of ordinary regex matching. The example was in fact simplified. Actually I want to parse some input, eventually containing parts of interest and everything else can be ignored. E.g.
"Some really (* this is important *) unimportant(* this too*) stuff."
The (* ... *) parts are well defined by some grammer, but I am unsure how to get easily rid of the garbage between them.
Cheers! Steffen
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steffen Märcker
Hi,
I've played around a bit more and found something strange: Scanner: <noise> : [a]* ; Parser: Content: <noise> ; Does neither parse 'a' nor anything else. Changing the scanner to: <noise> : [a]+ ; Does work. Is there something obvious, I am not aware of? (Because the scanner seems to recognize the token.) Steffen _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Steffen Märcker wrote:
> Hi, > > I've played around a bit more and found something strange: > Scanner: > <noise> : [a]* ; > Parser: > Content: <noise> ; > Does neither parse 'a' nor anything else. Changing the scanner to: > <noise> : [a]+ ; > Does work. > Is there something obvious, I am not aware of? (Because the scanner seems > to recognize the token.) In the first case, <noise> matches zero or more a's. Therefore, when you parse something like "aaa", the scanner sees two <noise> tokens. The first token is "aaa" and the second token is "". Your parser is only looking for one <noise> token so you get an error. It is generally a bad idea to have tokens that can be empty. Instead you should write your grammar like: <noise> : a+; Content: <noise> | ; This will accept zero or more a's. John Brant _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |