Smalltalk › Cincom › VisualWorks

[vwnc] SmaCC, matching arbitrary character sequence

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

8 messages Options

Steffen Märcker

[vwnc] SmaCC, matching arbitrary character sequence

Hi,

I'd like to define a parser for the following grammer:

Test: Noise "foo" Noise;
Noise: <noise>;
noise: .*;

I'd expect that this successfully parses "__foo", but in fact it does not
parse anything successfully.
However, changing the definition to:

Test: Noise* "foo" Noise*;
Noise: <noise>
noise: .;

does work. Can anybody what makes the difference between these two
versions, which I've expected to be equal?

THX,
Steffen
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

thomas.hawker

Re: [vwnc] SmaCC, matching arbitrary character sequence

Steffen,

Sure. The difference is in the grammar expectations. In the first case,

Test: Noise "foo" Noise

the grammar says "Find one occurrence of Noise, followed by the literal 'foo', followed another occurrence of Noise." Thus, the input string of "__foo" doesn't match the definitions provided because there is nothing that corresponds to a Noise non-terminal. Even if you had a Noise before and after, I'm not sure it would parse because it wouldn't be able to handle the "__".

The second case:

Test: Noise* "foo" Noise*

states, "Find zero or more occurrences of Noise, followed by the literal 'foo', followed by zero or more occurrences of Noise." In this case, it should parse the input string "foo", but I'm not sure it would or should parse "__foo" (again, those "__").

I imagine your grammar is abbreviated, but the definition of "noise" would eat any and all '>' because '.' doesn't differentiate. Thus, the instant you add any '<', all remaining input would become part of "noise". Should it really be "[^>]*"?

Cheers!

Tom Hawker
--------------------------
Senior Framework Developer
--------------------------
Home +1 (408) 274-4128
Office +1 (408) 576-6591
Mobile +1 (408) 835-3643

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Steffen Märcker
Sent: Tuesday, October 20, 2009 10:47 AM
To: vwnc
Subject: [vwnc] SmaCC, matching arbitrary character sequence

Hi,

I'd like to define a parser for the following grammer:

Test: Noise "foo" Noise;
Noise: <noise>;
noise: .*;

I'd expect that this successfully parses "__foo", but in fact it does not
parse anything successfully.
However, changing the definition to:

Test: Noise* "foo" Noise*;
Noise: <noise>
noise: .;

does work. Can anybody what makes the difference between these two
versions, which I've expected to be equal?

THX,
Steffen
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

IMPORTANT NOTICE
Email from OOCL is confidential and may be legally privileged. If it is not
intended for you, please delete it immediately unread. The internet
cannot guarantee that this communication is free of viruses, interception
or interference and anyone who communicates with us by email is taken
to accept the risks in doing so. Without limitation, OOCL and its affiliates
accept no liability whatsoever and howsoever arising in connection with
the use of this email. Under no circumstances shall this email constitute
a binding agreement to carry or for provision of carriage services by OOCL,
which is subject to the availability of carrier's equipment and vessels and
the terms and conditions of OOCL's standard bill of lading which is also
available at http://www.oocl.com.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

John Brant-2

Re: [vwnc] SmaCC, matching arbitrary character sequence

In reply to this post by Steffen Märcker

Steffen Märcker wrote:

> I'd like to define a parser for the following grammer:
>
> Test: Noise "foo" Noise;
> Noise: <noise>;
> noise: .*;
>
> I'd expect that this successfully parses "__foo", but in fact it does not
> parse anything successfully.
> However, changing the definition to:
>
> Test: Noise* "foo" Noise*;
> Noise: <noise>
> noise: .;
>
> does work. Can anybody what makes the difference between these two
> versions, which I've expected to be equal?

The scanner is greedy and matches the longest string possible.
Therefore, in your first example, the scanner creates a single <noise>
token: "__foo". The parser fails since you don't have a "foo" token. In
the second example, your <noise> token is only a single character so the
scanner creates two <noise> "_" tokens followed by a "foo" token.

If you want to use the first example, you'll need to either update your
<noise> regex ".*" to not include the "foo" keyword or you'll need to
write a method on your scanner class to handle this.

John Brant
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Wallen, David

Re: [vwnc] SmaCC, matching arbitrary character sequence

In reply to this post by thomas.hawker

I suspect that you mean:

- Parser spec -
> Test: Noise "foo" Noise;
> Noise: <noise>;
- Scanner spec -
> noise: .*;

And iirc, as I think Tom indicates below, the scanner above will chew up everything, leaving nothing. In your second pair of specs, the scanner chews only one character at a time, leaving the parser something to look at.

- Dave

> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf
> Of [hidden email]
> Sent: Tuesday, October 20, 2009 11:20 AM
> To: [hidden email]; [hidden email]
> Subject: Re: [vwnc] SmaCC, matching arbitrary character sequence
>
> Steffen,
>
> Sure. The difference is in the grammar expectations. In the first case,
>
> Test: Noise "foo" Noise
>
> the grammar says "Find one occurrence of Noise, followed by the literal
> 'foo', followed another occurrence of Noise." Thus, the input string of
> "__foo" doesn't match the definitions provided because there is nothing
> that corresponds to a Noise non-terminal. Even if you had a Noise before
> and after, I'm not sure it would parse because it wouldn't be able to
> handle the "__".
>
> The second case:
>
> Test: Noise* "foo" Noise*
>
> states, "Find zero or more occurrences of Noise, followed by the literal
> 'foo', followed by zero or more occurrences of Noise." In this case, it
> should parse the input string "foo", but I'm not sure it would or should
> parse "__foo" (again, those "__").
>
> I imagine your grammar is abbreviated, but the definition of "noise" would
> eat any and all '>' because '.' doesn't differentiate. Thus, the instant
> you add any '<', all remaining input would become part of "noise". Should
> it really be "[^>]*"?
>
> Cheers!
>
> Tom Hawker
> --------------------------
> Senior Framework Developer
> --------------------------
> Home +1 (408) 274-4128
> Office +1 (408) 576-6591
> Mobile +1 (408) 835-3643
>
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf
> Of Steffen Märcker
> Sent: Tuesday, October 20, 2009 10:47 AM
> To: vwnc
> Subject: [vwnc] SmaCC, matching arbitrary character sequence
>
> Hi,
>
> I'd like to define a parser for the following grammer:
>
> Test: Noise "foo" Noise;
> Noise: <noise>;
> noise: .*;
>
> I'd expect that this successfully parses "__foo", but in fact it does not
> parse anything successfully.
> However, changing the definition to:
>
> Test: Noise* "foo" Noise*;
> Noise: <noise>
> noise: .;
>
> does work. Can anybody what makes the difference between these two
> versions, which I've expected to be equal?
>
> THX,
> Steffen
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
> IMPORTANT NOTICE
> Email from OOCL is confidential and may be legally privileged. If it is
> not
> intended for you, please delete it immediately unread. The internet
> cannot guarantee that this communication is free of viruses, interception
> or interference and anyone who communicates with us by email is taken
> to accept the risks in doing so. Without limitation, OOCL and its
> affiliates
> accept no liability whatsoever and howsoever arising in connection with
> the use of this email. Under no circumstances shall this email constitute
> a binding agreement to carry or for provision of carriage services by
> OOCL,
> which is subject to the availability of carrier's equipment and vessels
> and
> the terms and conditions of OOCL's standard bill of lading which is also
> available at http://www.oocl.com.
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Steffen Märcker

Re: [vwnc] SmaCC, matching arbitrary character sequence

In reply to this post by thomas.hawker

Thank you, Tom, John and Wallen for your detailed explanation. This is the
first time I am using a real parser instead of ordinary regex matching.
The example was in fact simplified. Actually I want to parse some input,
eventually containing parts of interest and everything else can be
ignored. E.g.

"Some really (* this is important *) unimportant(* this too*) stuff."

The (* ... *) parts are well defined by some grammer, but I am unsure how
to get easily rid of the garbage between them.

Cheers!
Steffen

Am 20.10.2009, 20:20 Uhr, schrieb <[hidden email]>:

> Steffen,
>
> Sure. The difference is in the grammar expectations. In the first case,
>
> Test: Noise "foo" Noise
>
> the grammar says "Find one occurrence of Noise, followed by the literal
> 'foo', followed another occurrence of Noise." Thus, the input string of
> "__foo" doesn't match the definitions provided because there is nothing
> that corresponds to a Noise non-terminal. Even if you had a Noise
> before and after, I'm not sure it would parse because it wouldn't be
> able to handle the "__".
>
> The second case:
>
> Test: Noise* "foo" Noise*
>
> states, "Find zero or more occurrences of Noise, followed by the literal
> 'foo', followed by zero or more occurrences of Noise." In this case, it
> should parse the input string "foo", but I'm not sure it would or should
> parse "__foo" (again, those "__").
>
> I imagine your grammar is abbreviated, but the definition of "noise"
> would eat any and all '>' because '.' doesn't differentiate. Thus, the
> instant you add any '<', all remaining input would become part of
> "noise". Should it really be "[^>]*"?
>
> Cheers!
> Tom Hawker
> --------------------------
> Senior Framework Developer
> --------------------------
> Home +1 (408) 274-4128
> Office +1 (408) 576-6591
> Mobile +1 (408) 835-3643
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of Steffen Märcker
> Sent: Tuesday, October 20, 2009 10:47 AM
> To: vwnc
> Subject: [vwnc] SmaCC, matching arbitrary character sequence
>
> Hi,
>
> I'd like to define a parser for the following grammer:
>
> Test: Noise "foo" Noise;
> Noise: <noise>;
> noise: .*;
>
> I'd expect that this successfully parses "__foo", but in fact it does not
> parse anything successfully.
> However, changing the definition to:
>
> Test: Noise* "foo" Noise*;
> Noise: <noise>
> noise: .;
>
> does work. Can anybody what makes the difference between these two
> versions, which I've expected to be equal?
>
> THX,
> Steffen
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
> IMPORTANT NOTICE
> Email from OOCL is confidential and may be legally privileged. If it is
> not
> intended for you, please delete it immediately unread. The internet
> cannot guarantee that this communication is free of viruses, interception
> or interference and anyone who communicates with us by email is taken
> to accept the risks in doing so. Without limitation, OOCL and its
> affiliates
> accept no liability whatsoever and howsoever arising in connection with
> the use of this email. Under no circumstances shall this email
> constitute
> a binding agreement to carry or for provision of carriage services by
> OOCL,
> which is subject to the availability of carrier's equipment and vessels
> and
> the terms and conditions of OOCL's standard bill of lading which is also
> available at http://www.oocl.com.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

thomas.hawker

Re: [vwnc] SmaCC, matching arbitrary character sequence

Steffen,

I’m not sure what you’re parsing. Is it just a comment (looking for specific stuff within the special delimiters) or is it something more comprehensive? The reason I ask is you don’t necessarily want to look for “noise* (‘(*’ content ‘)*’ noise*)*”. For example, if you’re scanning through a programming language, then you really need to look at lexemes, otherwise you might falsely recognize ‘(*’ inside a string.

Assuming you’re only scanning arbitray text, not a program or piece of a program where you have to worry about lexemes, I think you want something like this:

Parser:

Start: Text;

Text: Noise | Text Delimit Noise;

Noise: | Noise <noise>;

Delimit: <begin> Content <end>;

Content: ...

Scanner:

begin: \(\*;

end: \*\);

noise: .

This will skip any noise until the first begin delimiter, process any content until the first following end delimiter, and repeat, skipping noise between delimited text as per your example. This makes sure that all text is scanned so that the parser thinks it has not stopped prematurely.

Cheers!

Tom Hawker

--------------------------

Senior Framework Developer

--------------------------

Home +1 (408) 274-4128

Office +1 (408) 576-6591

Mobile +1 (408) 835-3643

-----Original Message-----
From: Steffen Märcker [mailto:[hidden email]]
Sent: Tuesday, October 20, 2009 1:02 PM
To: THOMAS HAWKER (IRIS2-ISD-OOCLL/SNT); [hidden email]
Subject: Re: [vwnc] SmaCC, matching arbitrary character sequence

Thank you, Tom, John and Wallen for your detailed explanation. This is the

first time I am using a real parser instead of ordinary regex matching.

The example was in fact simplified. Actually I want to parse some input,

eventually containing parts of interest and everything else can be

ignored. E.g.

"Some really (* this is important *) unimportant(* this too*) stuff."

The (* ... *) parts are well defined by some grammer, but I am unsure how

to get easily rid of the garbage between them.

Cheers!

Steffen

IMPORTANT NOTICE
Email from OOCL is confidential and may be legally privileged.  If it is not
intended for you, please delete it immediately unread.  The internet
cannot guarantee that this communication is free of viruses, interception
or interference and anyone who communicates with us by email is taken
to accept the risks in doing so.  Without limitation, OOCL and its affiliates
accept no liability whatsoever and howsoever arising in connection with
the use of this email.  Under no circumstances shall this email constitute
a binding agreement to carry or for provision of carriage services by OOCL,
which is subject to the availability of carrier's equipment and vessels and
the terms and conditions of OOCL's standard bill of lading which is also
available at http://www.oocl.com.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Steffen Märcker

Re: [vwnc] SmaCC, matching arbitrary character sequence

In reply to this post by Steffen Märcker

Hi,

I've played around a bit more and found something strange:
Scanner:
<noise> : [a]* ;
Parser:
Content: <noise> ;
Does neither parse 'a' nor anything else. Changing the scanner to:
<noise> : [a]+ ;
Does work.
Is there something obvious, I am not aware of? (Because the scanner seems
to recognize the token.)

Steffen
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

John Brant-2

Re: [vwnc] SmaCC, matching arbitrary character sequence

Steffen Märcker wrote:

> Hi,
>
> I've played around a bit more and found something strange:
> Scanner:
> <noise> : [a]* ;
> Parser:
> Content: <noise> ;
> Does neither parse 'a' nor anything else. Changing the scanner to:
> <noise> : [a]+ ;
> Does work.
> Is there something obvious, I am not aware of? (Because the scanner seems
> to recognize the token.)

In the first case, <noise> matches zero or more a's. Therefore, when you
parse something like "aaa", the scanner sees two <noise> tokens. The
first token is "aaa" and the second token is "". Your parser is only
looking for one <noise> token so you get an error.

It is generally a bad idea to have tokens that can be empty. Instead you
should write your grammar like:

<noise> : a+;
Content: <noise> | ;

This will accept zero or more a's.

John Brant
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc