Re: Second expression in sequence confusion FreeLink <- "[[" .{&[>\]]} "]]"

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Second expression in sequence confusion FreeLink <- "[[" .{&[>\]]} "]]"

Squeak - Dev mailing list
Hi Levente,

Quick status update. 


And, I am able to hit the db for Wikimedia markup, parse it and serve it via Seaside in amazing speed.



And some of the cruft I have on the first iterations of the Grammar is starting to bite me.

So! I am rebuilding the grammar from first things using the grammarPEG as the baseline and building up.

Page <- (Break / Paragraph)*


Paragraph <-   .{1,"\n"}

/* Rules from grammarPEG */
s <- S*                     /* s is zero or more whitespace */
S <- whitespace+   /* S is one or more whitespace */

/* Primals from grammarPEG */
whitespace <- [\s\t\n\r]
SLASH <- "/"
BACKSLASH <- "\\"
AND <- "&"
NOT <- "!"
COMMA <- ","
QUESTION <- "?"
STAR <- "*"
PLUS <- "+"
DASH <- "-"
DOT <- "."
QUOTE <- "''"
DOUBLE_QUOTE <- ''"''
OPEN_BRACKET <- "["
CLOSE_BRACKET <- "]"
OPEN_PAREN <- "("
CLOSE_PAREN <- ")"
OPEN_BRACE <- "{"
CLOSE_BRACE <- "}"

'

Thanks to your help, I am able to reason through this rather than just hack at it.

Given this input:

testParagraph
^
'This is a test of paragraphs.
One hard return.
',


WikitextParserLibrary loremIpsum,


'


two hard returns above',


'

I get the following (correct!!!!) output.


<body><p>This is a test of paragraphs.</p><p>One hard return.</p><p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p><br/><p>two hard returns above</p><br/><br/><p>three hard returns above</p></body>


Thank you again for your help.

One quick "style" question if you have a moment. Currently Paragraph looks like this:


Paragraph: anOrderedCollection

<action: 'Paragraph'>
|text|
Transcript show:'Paragraph';cr.
text := ''.
anOrderedCollection do:[:each | text := text,each asString].

^self newElementTag: Paragraph elements: (Array with: (self newText: text))
Its the iteration that feels odd. to me. It works, but it feels odd.

Thoughts?

cheers,

tty

---- On Wed, 11 Sep 2019 11:38:58 -0400 Levente Uzonyi <[hidden email]> wrote ----

Hi Tim,

On Wed, 11 Sep 2019, gettimothy wrote:

> He Levente.
>
> Thank you very much. This is a huge time saver.
>
> I was literally reading the wrong documentation.
>
> whew!
>
> Where is this documented? I do not see it on the wiki link you sent me. 

As far as I know, there's no documentation for it.

> Is it infferable from a class in the XTreams packages?

Yes, it is, because PEGParser is written in itself.
The grammar to write PEG grammars is in PEGParser class >> #grammarPEG.
Here's the part related to cardinality parsing:

Cardinality            <-    OPEN_BRACE s (CardinalityRange / CardinalityLoopMin / CardinalityRangeMin / CardinalityLoop) s CLOSE_BRACE
CardinalityRangeMin    <-    NumLiteral
CardinalityRange    <-    NumLiteral s COMMA s NumLiteral
CardinalityLoopMin    <-    NumLiteral s COMMA s Expression
CardinalityLoop        <-    Expression

The first line defines 4 potential cardinality descriptions. All of these
start with { (OPEN_BRACE) and end with } (OPEN_BRACE). There can be any
number of spaces between the braces (s nonterminal).

The first one, CardinalityRangeMin accepts a single NumLiteral, which is
either a number with no leading zeroes, or the string "Infinity". Infinity
doesn't make any sense here (and the parser doesn't handle it btw).
When PEGParserParser >> #CardinalityRangeMin: processes this rule, it'll
create a block that will send #repeat:min:max: to PEGParser with the
parsed number as min and max too. So, the actual logic is in PEGParser >> #repeat:min:max:.
An example for this rule is
    Foo <- "x" {3}
which accepts Foo when there are three consecutive x characters on the
input stream.

The second one, CardinalityRange is similar to CardinalityRangeMin, but
accepts an upper bound as well:
    Foo <- "x"{3, 5}
accepts Foo when there are at least three consecutive x characters
on the input stream, but it'll consume up to 5 when there are that many.
Here the Infinity value makes sense:
    Foo <- "x"{3, Infinity}
accepts Foo when there are at least three consecutive x characters
on the input stream, but will consume all further x characters no matter
how many are there.

The third one, CardinalityLoopMin is the one you asked about in your other
mail. Instead of an upper bound, it takes a stop expression, which can be
anything from simple to advanced. The rule
    Foo <- [a-z]{3, "foo" / "bar"}
takes at least 3 lowercase ascii letters, then it will take further such
characters up until "foo" or "bar" appears on the stream. It will read
those characters as well, but will not yield them. By yield, I mean that
your parser will not receive the characters of "foo" or "bar", so when
you write your method processing the Foo rule, you will not know whether
the input ended with "foo", "bar" or a non-ascii-letter character.

The fourth one is similar to the third one, but it works as if 0
were added as the minimum number of repetitions of the pattern.
E.g.:
    Foo <- [a-z]{"foo" / "bar"}
is equivalent to
    Foo <- [a-z]{0, "foo" / "bar"}

>
> I looked over the XTreams tests classes yesterday and did not see any clues.
> Do the test cases need improving? (I can contribute by doing the grunt work under supervision)

Yes, some things have no tests. For example, these cardinality rules. I'll
push some fixes to the repository related to them soon.

>
> Also, what does "don't yield them" mean?

I tried to explain above. I'll give you a more complete example if that
doesn't make it clear. Just let me know.

>
> Does it mean the parser stays in its present spot?

No. It means that the generated parser consumes the characters from the
input, but doesn't pass them to the rule processor methods.

>
> thank you again.
>
> t.
>
> p.s. I have cc'ed squeak-beginners list on this message.

I don't think that's the appropriate place for these messages, because
this is anything but beginner stuff.
I think squeak-dev would be a much better place for now. Should our
discussion cause too much noise there, we can create a squeak-users list
for them in the future.


Levente

>
>
>
>
> ---- On Tue, 10 Sep 2019 20:30:07 -0400 Levente Uzonyi <[hidden email]> wrote ----
>
> Hi Tim,
>
> On Tue, 10 Sep 2019, gettimothy wrote:
>
> > Hi Levente.
> >
> > If you don't have time for this, "No" is  a good answer.
> >
> > I have the WikiMedia freelinks working. https://en.wikipedia.org/wiki/Help:Wikitext#Free_links
> >
> > [[This Is A Link]] generates 
> >
> > <a href="https://en.wikipedia.org/wiki/This_is_a_link">This is a link</a>
> >
> > I would like to translate the FreeLink <- "[[" .{&[>\]]} "]]" sequence into something like
> >
> > FreeLink <- LinkOpen  .{&[>\]]}  LinkClose
> > LinkOpen   <- BracketOpen BracketOpen
> > LinkClose   <- BracketClose BracketClose
> > BracketOpen <- "["
> > BracketClose <- "]"
> >
> > so that I can iteratively build up to more complicated link styles.
> >
> > That Capture in the middle of the sequence is giving me fits.
> > Something as simple as:
> >
> > FreeLink <- LinkOpen  .{&[>\]]}  LinkClose
> >
> > does not parse as neither LinkOpen nor LinkClose are consumed.
> >
> > My interpretation of that middle sequence term is:
> > "." get the next character and consume it.
> > {&[>\]]} apply expression &[>\]] and capture the string that matched it for later use.
>
> In Xtreams-Parsing, braces don't mean capture. They mean cardinality. It
> comes from common regular expression syntax.
> The regular expression x{1,3} means x 1 to 3 times, so it accepts x,
> xx, and xxx.
> You can also pass a single number x{3}, which is a shorthand for xxx.
> You can also omit the second argument like in x{3,}, which means x 3 or
> more times.
> This construct is extended in Xtreams-Parsing with a stop expression.
> "x"{"y"} means, accept any number of x up until y comes. Consume y too,
> but don't yield it. So, such expression accepts: xy, xxy, xxxy, xxxxy,
> etc, and yields x, xx, xxx, xxxx, etc.
>
> I suspect having & inside {} probably causes problems, because {} tries to
> consume what it parses, but & tells the parser not to consume what comes
> after it.
>
> If I were to write the FreeLink rule, it would be something like:
>
> FreeLink <- "[[" .{"]]"}
>
> It means: take two opening braces, accept and yield everything up to two
> closing braces, then consume those too, but don't yield them.
>
>
> Levente
>
> > &[>\]] AND predicate : indicate success if expression [>\]] matches text ahead; otherwise indicate failure. do not consume text.
> > [>\]]  character range between ">" and "]"
> >
> >
> > Thanks for your time.
> >
> > cordially,
> >
> > t
> >
> >
> >
> >
> >
> >
> >
>
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Second expression in sequence confusion FreeLink <- "[[" .{&[>\]]} "]]"

Levente Uzonyi
Hi Tim,

On Wed, 9 Oct 2019, gettimothy wrote:

> Hi Levente,
>
> Quick status update. 
>
> I have built through Links https://en.wikipedia.org/wiki/Help:Wikitext#Links_and_URLs
> and Images https://en.wikipedia.org/wiki/Help:Visual_file_markup
>
> And, I am able to hit the db for Wikimedia markup, parse it and serve it via Seaside in amazing speed.

That's great.

>
>
> I am now on Tables: https://en.wikipedia.org/wiki/Help:Table
>
> And some of the cruft I have on the first iterations of the Grammar is starting to bite me.
>
> So! I am rebuilding the grammar from first things using the grammarPEG as the baseline and building up.

I'm not sure that's a good idea. Those things may help, but they are very
specific to the PEG language used by the parser.

>
>       Page <- (Break / Paragraph)*
>
>
> Paragraph <-   .{1,"\n"}
> Break <- "\n"{2}   /* https://en.wikipedia.org/wiki/Help:Wikitext#Line_breaks */

Be careful with \n. For some reason, it's probably a bug, but may be
intentional, PEGParserParser interprets \n as CR, and \r as LF, which is
the exact opposite what someone familiar with these escapes would expect.
This swap helps when you process strings written in Squeak, which have
CRs as line endings, but once you start to process content from outside,
which will either have LF or CRLF line endings, the parser will break.
You can work this around by creating and EndOfLine rule and use that
instead of "\n":

EndOfLine <- "\n\r" / "\n" / "\r" /* CRLF or CR or LF */

>
> /* Rules from grammarPEG */
> s <- S*                     /* s is zero or more whitespace */
> S <- whitespace+   /* S is one or more whitespace */
>
> /* Primals from grammarPEG */
> whitespace <- [\s\t\n\r]
> SLASH <- "/"
> BACKSLASH <- "\\"
> AND <- "&"
> NOT <- "!"
> COMMA <- ","
> QUESTION <- "?"
> STAR <- "*"
> PLUS <- "+"
> DASH <- "-"
> DOT <- "."
> QUOTE <- "''"
> DOUBLE_QUOTE <- ''"''
> OPEN_BRACKET <- "["
> CLOSE_BRACKET <- "]"
> OPEN_PAREN <- "("
> CLOSE_PAREN <- ")"
> OPEN_BRACE <- "{"
> CLOSE_BRACE <- "}"
>
> '
>
>
> Thanks to your help, I am able to reason through this rather than just hack at it.
>
> Given this input:
>
>       testParagraph
> ^
> 'This is a test of paragraphs.
> One hard return.
> ',
>
>
> WikitextParserLibrary loremIpsum,
>
>
> '
>
>
> two hard returns above',
>
>
> '
>
>
> I get the following (correct!!!!) output.
>
>
>       <body><p>This is a test of paragraphs.</p><p>One hard return.</p><p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
>       veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint
>       occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p><br/><p>two hard returns above</p><br/><br/><p>three hard returns above</p></body>
>
>
>
> Thank you again for your help.
>
> One quick "style" question if you have a moment. Currently Paragraph looks like this:
>
>
>       Paragraph: anOrderedCollection
>
> <action: 'Paragraph'>
> |text|
> Transcript show:'Paragraph';cr.
> text := ''.
> anOrderedCollection do:[:each | text := text,each asString].
>
> ^self newElementTag: Paragraph elements: (Array with: (self newText: text))
>
> Its the iteration that feels odd. to me. It works, but it feels odd.
>
> Thoughts?
That concatenation inside the loop will end up being too slow (O(n^2)).
I'd write it as:

  text := anOrderedCollection joinSeaparatedBy: ''.

It's still not optimal, but it's short, and the runtime is linear.


Levente

>
> cheers,
>
> tty
>
> ---- On Wed, 11 Sep 2019 11:38:58 -0400 Levente Uzonyi <[hidden email]> wrote ----
>
>       Hi Tim,
>
>       On Wed, 11 Sep 2019, gettimothy wrote:
>
>       > He Levente.
>       >
>       > Thank you very much. This is a huge time saver.
>       >
>       > I was literally reading the wrong documentation.
>       >
>       > whew!
>       >
>       > Where is this documented? I do not see it on the wiki link you sent me. 
>
>       As far as I know, there's no documentation for it.
>
>       > Is it infferable from a class in the XTreams packages?
>
>       Yes, it is, because PEGParser is written in itself.
>       The grammar to write PEG grammars is in PEGParser class >> #grammarPEG.
>       Here's the part related to cardinality parsing:
>
>       Cardinality            <-    OPEN_BRACE s (CardinalityRange / CardinalityLoopMin / CardinalityRangeMin / CardinalityLoop) s CLOSE_BRACE
>       CardinalityRangeMin    <-    NumLiteral
>       CardinalityRange    <-    NumLiteral s COMMA s NumLiteral
>       CardinalityLoopMin    <-    NumLiteral s COMMA s Expression
>       CardinalityLoop        <-    Expression
>
>       The first line defines 4 potential cardinality descriptions. All of these
>       start with { (OPEN_BRACE) and end with } (OPEN_BRACE). There can be any
>       number of spaces between the braces (s nonterminal).
>
>       The first one, CardinalityRangeMin accepts a single NumLiteral, which is
>       either a number with no leading zeroes, or the string "Infinity". Infinity
>       doesn't make any sense here (and the parser doesn't handle it btw).
>       When PEGParserParser >> #CardinalityRangeMin: processes this rule, it'll
>       create a block that will send #repeat:min:max: to PEGParser with the
>       parsed number as min and max too. So, the actual logic is in PEGParser >> #repeat:min:max:.
>       An example for this rule is
>           Foo <- "x" {3}
>       which accepts Foo when there are three consecutive x characters on the
>       input stream.
>
>       The second one, CardinalityRange is similar to CardinalityRangeMin, but
>       accepts an upper bound as well:
>           Foo <- "x"{3, 5}
>       accepts Foo when there are at least three consecutive x characters
>       on the input stream, but it'll consume up to 5 when there are that many.
>       Here the Infinity value makes sense:
>           Foo <- "x"{3, Infinity}
>       accepts Foo when there are at least three consecutive x characters
>       on the input stream, but will consume all further x characters no matter
>       how many are there.
>
>       The third one, CardinalityLoopMin is the one you asked about in your other
>       mail. Instead of an upper bound, it takes a stop expression, which can be
>       anything from simple to advanced. The rule
>           Foo <- [a-z]{3, "foo" / "bar"}
>       takes at least 3 lowercase ascii letters, then it will take further such
>       characters up until "foo" or "bar" appears on the stream. It will read
>       those characters as well, but will not yield them. By yield, I mean that
>       your parser will not receive the characters of "foo" or "bar", so when
>       you write your method processing the Foo rule, you will not know whether
>       the input ended with "foo", "bar" or a non-ascii-letter character.
>
>       The fourth one is similar to the third one, but it works as if 0
>       were added as the minimum number of repetitions of the pattern.
>       E.g.:
>           Foo <- [a-z]{"foo" / "bar"}
>       is equivalent to
>           Foo <- [a-z]{0, "foo" / "bar"}
>
>       >
>       > I looked over the XTreams tests classes yesterday and did not see any clues.
>       > Do the test cases need improving? (I can contribute by doing the grunt work under supervision)
>
>       Yes, some things have no tests. For example, these cardinality rules. I'll
>       push some fixes to the repository related to them soon.
>
>       >
>       > Also, what does "don't yield them" mean?
>
>       I tried to explain above. I'll give you a more complete example if that
>       doesn't make it clear. Just let me know.
>
>       >
>       > Does it mean the parser stays in its present spot?
>
>       No. It means that the generated parser consumes the characters from the
>       input, but doesn't pass them to the rule processor methods.
>
>       >
>       > thank you again.
>       >
>       > t.
>       >
>       > p.s. I have cc'ed squeak-beginners list on this message.
>
>       I don't think that's the appropriate place for these messages, because
>       this is anything but beginner stuff.
>       I think squeak-dev would be a much better place for now. Should our
>       discussion cause too much noise there, we can create a squeak-users list
>       for them in the future.
>
>
>       Levente
>
>       >
>       >
>       >
>       >
>       > ---- On Tue, 10 Sep 2019 20:30:07 -0400 Levente Uzonyi <[hidden email]> wrote ----
>       >
>       > Hi Tim,
>       >
>       > On Tue, 10 Sep 2019, gettimothy wrote:
>       >
>       > > Hi Levente.
>       > >
>       > > If you don't have time for this, "No" is  a good answer.
>       > >
>       > > I have the WikiMedia freelinks working. https://en.wikipedia.org/wiki/Help:Wikitext#Free_links
>       > >
>       > > [[This Is A Link]] generates 
>       > >
>       > > <a href="https://en.wikipedia.org/wiki/This_is_a_link">This is a link</a>
>       > >
>       > > I would like to translate the FreeLink <- "[[" .{&[>\]]} "]]" sequence into something like
>       > >
>       > > FreeLink <- LinkOpen  .{&[>\]]}  LinkClose
>       > > LinkOpen   <- BracketOpen BracketOpen
>       > > LinkClose   <- BracketClose BracketClose
>       > > BracketOpen <- "["
>       > > BracketClose <- "]"
>       > >
>       > > so that I can iteratively build up to more complicated link styles.
>       > >
>       > > That Capture in the middle of the sequence is giving me fits.
>       > > Something as simple as:
>       > >
>       > > FreeLink <- LinkOpen  .{&[>\]]}  LinkClose
>       > >
>       > > does not parse as neither LinkOpen nor LinkClose are consumed.
>       > >
>       > > My interpretation of that middle sequence term is:
>       > > "." get the next character and consume it.
>       > > {&[>\]]} apply expression &[>\]] and capture the string that matched it for later use.
>       >
>       > In Xtreams-Parsing, braces don't mean capture. They mean cardinality. It
>       > comes from common regular expression syntax.
>       > The regular expression x{1,3} means x 1 to 3 times, so it accepts x,
>       > xx, and xxx.
>       > You can also pass a single number x{3}, which is a shorthand for xxx.
>       > You can also omit the second argument like in x{3,}, which means x 3 or
>       > more times.
>       > This construct is extended in Xtreams-Parsing with a stop expression.
>       > "x"{"y"} means, accept any number of x up until y comes. Consume y too,
>       > but don't yield it. So, such expression accepts: xy, xxy, xxxy, xxxxy,
>       > etc, and yields x, xx, xxx, xxxx, etc.
>       >
>       > I suspect having & inside {} probably causes problems, because {} tries to
>       > consume what it parses, but & tells the parser not to consume what comes
>       > after it.
>       >
>       > If I were to write the FreeLink rule, it would be something like:
>       >
>       > FreeLink <- "[[" .{"]]"}
>       >
>       > It means: take two opening braces, accept and yield everything up to two
>       > closing braces, then consume those too, but don't yield them.
>       >
>       >
>       > Levente
>       >
>       > > &[>\]] AND predicate : indicate success if expression [>\]] matches text ahead; otherwise indicate failure. do not consume text.
>       > > [>\]]  character range between ">" and "]"
>       > >
>       > >
>       > > Thanks for your time.
>       > >
>       > > cordially,
>       > >
>       > > t
>       > >
>       > >
>       > >
>       > >
>       > >
>       > >
>       > >
>       >
>       >
>       >
>       >
>       >
>
>
>
>
>