[vwnc] SmaCC, matching arbitrary character sequence

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[vwnc] SmaCC, matching arbitrary character sequence

Steffen Märcker
Hi,

I'd like to define a parser for the following grammer:

Test: Noise "foo" Noise;
Noise: <noise>;
noise: .*;

I'd expect that this successfully parses "__foo", but in fact it does not  
parse anything successfully.
However, changing the definition to:

Test: Noise* "foo" Noise*;
Noise: <noise>
noise: .;

does work. Can anybody what makes the difference between these two  
versions, which I've expected to be equal?

THX,
Steffen
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] SmaCC, matching arbitrary character sequence

thomas.hawker
Steffen,

Sure.  The difference is in the grammar expectations.  In the first case,

        Test: Noise "foo" Noise

the grammar says "Find one occurrence of Noise, followed by the literal 'foo', followed another occurrence of Noise."  Thus, the input string of "__foo" doesn't match the definitions provided because there is nothing that corresponds to a Noise non-terminal.  Even if you had a Noise before and after, I'm not sure it would parse because it wouldn't be able to handle the "__".

The second case:

        Test: Noise* "foo" Noise*

states, "Find zero or more occurrences of Noise, followed by the literal 'foo', followed by zero or more occurrences of Noise."  In this case, it should parse the input string "foo", but I'm not sure it would or should parse "__foo" (again, those "__").

I imagine your grammar is abbreviated, but the definition of "noise" would eat any and all '>' because '.' doesn't differentiate.  Thus, the instant you add any '<', all remaining input would become part of "noise".  Should it really be "[^>]*"?

Cheers!
 
Tom Hawker
--------------------------
Senior Framework Developer
--------------------------
Home +1 (408) 274-4128
Office +1 (408) 576-6591
Mobile +1 (408) 835-3643
 

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Steffen Märcker
Sent: Tuesday, October 20, 2009 10:47 AM
To: vwnc
Subject: [vwnc] SmaCC, matching arbitrary character sequence

Hi,

I'd like to define a parser for the following grammer:

Test: Noise "foo" Noise;
Noise: <noise>;
noise: .*;

I'd expect that this successfully parses "__foo", but in fact it does not  
parse anything successfully.
However, changing the definition to:

Test: Noise* "foo" Noise*;
Noise: <noise>
noise: .;

does work. Can anybody what makes the difference between these two  
versions, which I've expected to be equal?

THX,
Steffen
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

IMPORTANT NOTICE
Email from OOCL is confidential and may be legally privileged.  If it is not
intended for you, please delete it immediately unread.  The internet
cannot guarantee that this communication is free of viruses, interception
or interference and anyone who communicates with us by email is taken
to accept the risks in doing so.  Without limitation, OOCL and its affiliates
accept no liability whatsoever and howsoever arising in connection with
the use of this email.  Under no circumstances shall this email constitute
a binding agreement to carry or for provision of carriage services by OOCL,
which is subject to the availability of carrier's equipment and vessels and
the terms and conditions of OOCL's standard bill of lading which is also
available at http://www.oocl.com.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] SmaCC, matching arbitrary character sequence

John Brant-2
In reply to this post by Steffen Märcker
Steffen Märcker wrote:

> I'd like to define a parser for the following grammer:
>
> Test: Noise "foo" Noise;
> Noise: <noise>;
> noise: .*;
>
> I'd expect that this successfully parses "__foo", but in fact it does not  
> parse anything successfully.
> However, changing the definition to:
>
> Test: Noise* "foo" Noise*;
> Noise: <noise>
> noise: .;
>
> does work. Can anybody what makes the difference between these two  
> versions, which I've expected to be equal?


The scanner is greedy and matches the longest string possible.
Therefore, in your first example, the scanner creates a single <noise>
token: "__foo". The parser fails since you don't have a "foo" token. In
the second example, your <noise> token is only a single character so the
scanner creates two <noise> "_" tokens followed by a "foo" token.

If you want to use the first example, you'll need to either update your
<noise> regex ".*" to not include the "foo" keyword or you'll need to
write a method on your scanner class to handle this.


John Brant
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] SmaCC, matching arbitrary character sequence

Wallen, David
In reply to this post by thomas.hawker
I suspect that you mean:

- Parser spec -
  > Test: Noise "foo" Noise;
  > Noise: <noise>;
- Scanner spec -
  > noise: .*;

And iirc, as I think Tom indicates below, the scanner above will chew up everything, leaving nothing. In your second pair of specs, the scanner chews only one character at a time, leaving the parser something to look at.

- Dave

> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf
> Of [hidden email]
> Sent: Tuesday, October 20, 2009 11:20 AM
> To: [hidden email]; [hidden email]
> Subject: Re: [vwnc] SmaCC, matching arbitrary character sequence
>
> Steffen,
>
> Sure.  The difference is in the grammar expectations.  In the first case,
>
> Test: Noise "foo" Noise
>
> the grammar says "Find one occurrence of Noise, followed by the literal
> 'foo', followed another occurrence of Noise."  Thus, the input string of
> "__foo" doesn't match the definitions provided because there is nothing
> that corresponds to a Noise non-terminal.  Even if you had a Noise before
> and after, I'm not sure it would parse because it wouldn't be able to
> handle the "__".
>
> The second case:
>
> Test: Noise* "foo" Noise*
>
> states, "Find zero or more occurrences of Noise, followed by the literal
> 'foo', followed by zero or more occurrences of Noise."  In this case, it
> should parse the input string "foo", but I'm not sure it would or should
> parse "__foo" (again, those "__").
>
> I imagine your grammar is abbreviated, but the definition of "noise" would
> eat any and all '>' because '.' doesn't differentiate.  Thus, the instant
> you add any '<', all remaining input would become part of "noise".  Should
> it really be "[^>]*"?
>
> Cheers!
>
> Tom Hawker
> --------------------------
> Senior Framework Developer
> --------------------------
> Home +1 (408) 274-4128
> Office +1 (408) 576-6591
> Mobile +1 (408) 835-3643
>
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf
> Of Steffen Märcker
> Sent: Tuesday, October 20, 2009 10:47 AM
> To: vwnc
> Subject: [vwnc] SmaCC, matching arbitrary character sequence
>
> Hi,
>
> I'd like to define a parser for the following grammer:
>
> Test: Noise "foo" Noise;
> Noise: <noise>;
> noise: .*;
>
> I'd expect that this successfully parses "__foo", but in fact it does not
> parse anything successfully.
> However, changing the definition to:
>
> Test: Noise* "foo" Noise*;
> Noise: <noise>
> noise: .;
>
> does work. Can anybody what makes the difference between these two
> versions, which I've expected to be equal?
>
> THX,
> Steffen
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
> IMPORTANT NOTICE
> Email from OOCL is confidential and may be legally privileged.  If it is
> not
> intended for you, please delete it immediately unread.  The internet
> cannot guarantee that this communication is free of viruses, interception
> or interference and anyone who communicates with us by email is taken
> to accept the risks in doing so.  Without limitation, OOCL and its
> affiliates
> accept no liability whatsoever and howsoever arising in connection with
> the use of this email.  Under no circumstances shall this email constitute
> a binding agreement to carry or for provision of carriage services by
> OOCL,
> which is subject to the availability of carrier's equipment and vessels
> and
> the terms and conditions of OOCL's standard bill of lading which is also
> available at http://www.oocl.com.
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] SmaCC, matching arbitrary character sequence

Steffen Märcker
In reply to this post by thomas.hawker
Thank you, Tom, John and Wallen for your detailed explanation. This is the  
first time I am using a real parser instead of ordinary regex matching.
The example was in fact simplified. Actually I want to parse some input,  
eventually containing parts of interest and everything else can be  
ignored. E.g.

"Some really (* this is important *) unimportant(* this too*) stuff."

The (* ... *) parts are well defined by some grammer, but I am unsure how  
to get easily rid of the garbage between them.

Cheers!
Steffen


Am 20.10.2009, 20:20 Uhr, schrieb <[hidden email]>:

> Steffen,
>
> Sure.  The difference is in the grammar expectations.  In the first case,
>
> Test: Noise "foo" Noise
>
> the grammar says "Find one occurrence of Noise, followed by the literal  
> 'foo', followed another occurrence of Noise."  Thus, the input string of  
> "__foo" doesn't match the definitions provided because there is nothing  
> that corresponds to a Noise non-terminal.  Even if you had a Noise  
> before and after, I'm not sure it would parse because it wouldn't be  
> able to handle the "__".
>
> The second case:
>
> Test: Noise* "foo" Noise*
>
> states, "Find zero or more occurrences of Noise, followed by the literal  
> 'foo', followed by zero or more occurrences of Noise."  In this case, it  
> should parse the input string "foo", but I'm not sure it would or should  
> parse "__foo" (again, those "__").
>
> I imagine your grammar is abbreviated, but the definition of "noise"  
> would eat any and all '>' because '.' doesn't differentiate.  Thus, the  
> instant you add any '<', all remaining input would become part of  
> "noise".  Should it really be "[^>]*"?
>
> Cheers!
> Tom Hawker
> --------------------------
> Senior Framework Developer
> --------------------------
> Home +1 (408) 274-4128
> Office +1 (408) 576-6591
> Mobile +1 (408) 835-3643
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On  
> Behalf Of Steffen Märcker
> Sent: Tuesday, October 20, 2009 10:47 AM
> To: vwnc
> Subject: [vwnc] SmaCC, matching arbitrary character sequence
>
> Hi,
>
> I'd like to define a parser for the following grammer:
>
> Test: Noise "foo" Noise;
> Noise: <noise>;
> noise: .*;
>
> I'd expect that this successfully parses "__foo", but in fact it does not
> parse anything successfully.
> However, changing the definition to:
>
> Test: Noise* "foo" Noise*;
> Noise: <noise>
> noise: .;
>
> does work. Can anybody what makes the difference between these two
> versions, which I've expected to be equal?
>
> THX,
> Steffen
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
> IMPORTANT NOTICE
> Email from OOCL is confidential and may be legally privileged.  If it is  
> not
> intended for you, please delete it immediately unread.  The internet
> cannot guarantee that this communication is free of viruses, interception
> or interference and anyone who communicates with us by email is taken
> to accept the risks in doing so.  Without limitation, OOCL and its  
> affiliates
> accept no liability whatsoever and howsoever arising in connection with
> the use of this email.  Under no circumstances shall this email  
> constitute
> a binding agreement to carry or for provision of carriage services by  
> OOCL,
> which is subject to the availability of carrier's equipment and vessels  
> and
> the terms and conditions of OOCL's standard bill of lading which is also
> available at http://www.oocl.com.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] SmaCC, matching arbitrary character sequence

thomas.hawker

Steffen,

 

I’m not sure what you’re parsing.  Is it just a comment (looking for specific stuff within the special delimiters) or is it something more comprehensive?  The reason I ask is you don’t necessarily want to look for “noise* (‘(*’ content ‘)*’ noise*)*”.  For example, if you’re scanning through a programming language, then you really need to look at lexemes, otherwise you might falsely recognize ‘(*’ inside a string.

 

Assuming you’re only scanning arbitray text, not a program or piece of a program where you have to worry about lexemes, I think you want something like this:

 

      Parser:

            Start:      Text;

            Text:       Noise | Text Delimit Noise;

            Noise:      | Noise <noise>;

Delimit:    <begin> Content <end>;

            Content:    ...

      Scanner:

            begin:      \(\*;

            end:        \*\);

            noise:      .

 

This will skip any noise until the first begin delimiter, process any content until the first following end delimiter, and repeat, skipping noise between delimited text as per your example.  This makes sure that all text is scanned so that the parser thinks it has not stopped prematurely.

 

Cheers!

 

Tom Hawker

--------------------------

Senior Framework Developer

--------------------------

Home        +1 (408) 274-4128

Office      +1 (408) 576-6591

Mobile      +1 (408) 835-3643

 

 

-----Original Message-----
From: Steffen Märcker [mailto:[hidden email]]
Sent: Tuesday, October 20, 2009 1:02 PM
To: THOMAS HAWKER (IRIS2-ISD-OOCLL/SNT); [hidden email]
Subject: Re: [vwnc] SmaCC, matching arbitrary character sequence

 

Thank you, Tom, John and Wallen for your detailed explanation. This is the 

first time I am using a real parser instead of ordinary regex matching.

The example was in fact simplified. Actually I want to parse some input, 

eventually containing parts of interest and everything else can be 

ignored. E.g.

 

"Some really (* this is important *) unimportant(* this too*) stuff."

 

The (* ... *) parts are well defined by some grammer, but I am unsure how 

to get easily rid of the garbage between them.

 

Cheers!

Steffen

 

IMPORTANT NOTICE
Email from OOCL is confidential and may be legally privileged.  If it is not
intended for you, please delete it immediately unread.  The internet
cannot guarantee that this communication is free of viruses, interception
or interference and anyone who communicates with us by email is taken
to accept the risks in doing so.  Without limitation, OOCL and its affiliates
accept no liability whatsoever and howsoever arising in connection with
the use of this email.  Under no circumstances shall this email constitute
a binding agreement to carry or for provision of carriage services by OOCL,
which is subject to the availability of carrier's equipment and vessels and
the terms and conditions of OOCL's standard bill of lading which is also
available at http://www.oocl.com.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] SmaCC, matching arbitrary character sequence

Steffen Märcker
In reply to this post by Steffen Märcker
Hi,

I've played around a bit more and found something strange:
Scanner:
<noise> : [a]* ;
Parser:
Content: <noise> ;
Does neither parse 'a' nor anything else. Changing the scanner to:
<noise> : [a]+ ;
Does work.
Is there something obvious, I am not aware of? (Because the scanner seems  
to recognize the token.)

Steffen
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] SmaCC, matching arbitrary character sequence

John Brant-2
Steffen Märcker wrote:

> Hi,
>
> I've played around a bit more and found something strange:
> Scanner:
> <noise> : [a]* ;
> Parser:
> Content: <noise> ;
> Does neither parse 'a' nor anything else. Changing the scanner to:
> <noise> : [a]+ ;
> Does work.
> Is there something obvious, I am not aware of? (Because the scanner seems  
> to recognize the token.)

In the first case, <noise> matches zero or more a's. Therefore, when you
parse something like "aaa", the scanner sees two <noise> tokens. The
first token is "aaa" and the second token is "". Your parser is only
looking for one <noise> token so you get an error.

It is generally a bad idea to have tokens that can be empty. Instead you
should write your grammar like:

<noise> : a+;
Content: <noise> | ;

This will accept zero or more a's.


John Brant
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc