Smalltalk › Squeak › Squeak - Beginners

How to approach this PEGParser Grammar Fix.

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

tty

How to approach this PEGParser Grammar Fix.

Hi Folks,

In XTreams parsing, the grammarWiki/PEGWikiGenerator combo do not parse the Wikimedia headings.

I have copied the grammarWiki to grammarWikiMedia and I am slowly building it up in an attempt to isolate the problem.

Grammar looks like this:

grammarWikiMedia
"
http://en.wikipedia.org/wiki/Help:Wiki_markup
"
^
'Page <- (Heading)*

LineCharacter <- [^\n]
Flow <- Escape / Bold / Italic / LinkShort / LinkFull / LineCharacter
Escape <- "**" / "__" / "[["
Bold <- "*" Flow{"*"}
Italic <- "_" Flow{"_"}
LinkShort <- "[" .{&[>\]]} "]"
LinkFull <- "[" Flow{">"} .{"]"}

Line <- Flow{1,"\n"}
Paragraph <- Line
Empty <- "\n"
Whitespace <- [\t\s]*

Heading <- Heading6 / Heading5 / Heading4 / Heading3 / Heading2 / Heading1
Heading1 <- Whitespace "= " Flow{" ="}
Heading2 <- Whitespace "== " Flow{" =="}
Heading3 <- Whitespace "=== " Flow{" ==="}
Heading4 <- Whitespace "==== " Flow{" ===="}
Heading5 <- Whitespace "===== " Flow{" ====="}
Heading6 <- Whitespace "====== " Flow{" ======"}

'

For the Actor, I have copied the PEGWikiGenerator, saving it as PEGWikiMediaGenerator. I have made some minor additions to support H5 and H6 heading levels per Wikimedia standards.

My problem, is that Wikimedia seems to like to wrap its <hN></hN> tags within a paragraph: <p><hN></hN></p>

So, while I can parse this input just ducky:

| wikiGrammar wikiParser input output |
wikiGrammar := PEGParser grammarWikiMedia reading positioning. "This is your grammar converted to an xtream."
wikiParser := PEGParser parserPEG parse: 'Grammar' stream: wikiGrammar actor: PEGParserParser new. "This is the parser generated from your grammar."
input := ' = Heading 1 = == Heading 2 == === Heading 3 === ==== Heading 4 ==== ===== Heading 5 ===== ====== Heading 6 ======'.
output := wikiParser parse: 'Page' stream: input actor: PEGWikiMediaGenerator new. "An actual compiler doing the most basic stuff."
output inspect.

Producing an XMLElement looking like this:

<div><h1>Heading 1</h1><h2>Heading 2</h2><h3>Heading 3</h3><h4>Heading 4</h4><h5>Heading 5</h5><h6>Heading 6</h6></div>

When I wrap the <hn> elements in <p> tags for this input...

| wikiGrammar wikiParser input output |
wikiGrammar := PEGParser grammarWikiMedia reading positioning. "This is your grammar converted to an xtream."
wikiParser := PEGParser parserPEG parse: 'Grammar' stream: wikiGrammar actor: PEGParserParser new. "This is the parser generated from your grammar."
input := '<p>= Heading 1 =</p> <p>== Heading 2 ==</p> <p>=== Heading 3 ===</p> <p>==== Heading 4 ====</p> <p>===== Heading 5 =====</p> <p> ====== Heading 6 ======</p>'.
output := wikiParser parse: 'Page' stream: input actor: PEGWikiMediaGenerator new. "An actual compiler doing the most basic stuff."
output inspect.

my XMLElement looks like this:

<div/>

I am supposing that I have a wayward Grammar specification.

Where should I focus?

Should I hack at

Heading1 <- Whitespace "= " Flow{" ="}

and change "Whitespace" to something else ?

Or should I redefine the

Line <- Flow{1,"\n"}
Paragraph <- Line

duo?

If a general principle exists that will guide me going forward, I would very much appreciate it.

Thank you in advance.

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

tty

Re: How to approach this PEGParser Grammar Fix.

Figured out how to approach it.
Study PEG Parsing Expression Grammar
https://en.wikipedia.org/wiki/Parsing_expression_grammar

--
Sent from: http://forum.world.st/Squeak-Beginners-f107673.html
_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

tty

Re: How to approach this PEGParser Grammar Fix.

In reply to this post by tty

Slowly grokking the big picture.

A Gramamar is a bunch of PEG rules

in the grammarWiki (and now grammarMediaWiki) the "top rule" is

Page <- (Preformatted / Code / UnorderedList / OrderedList / Heading / Table
/ Paragraph / Empty)*

which is a a grouping of zero or more subrules. The rule for Preformatted
looks like this:

Preformatted <- "---\n" .{"---\n"}

Now here is the neat thing.

In the PEGActor subclass for that grammar PEGWikiGenerator, there are a
bunch of methods. However, within many of those methods are PRAGMAS.

Here is the Preformatted method:

Preformatted: text

<action: 'Preformatted' arguments: #(2)>
<action: 'Code' arguments: #(2)>

^self
newElementTag: Preformatted
elements: (Array with: (self newText: text))

See those pragmas?
They match up exactly with the Grammar rules.

Somehow, something, parses the grammars and then goes looking in the Actor
for pragmas. When a pragma in a method matches a rule, then I "think" that a
block is stored in the PEGParser and when that rule is encountered during
the parse, that block is executed.

--
Sent from: http://forum.world.st/Squeak-Beginners-f107673.html
_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

tty

Re: How to approach this PEGParser Grammar Fix.

In reply to this post by tty

So, to answer my initial questions...

The way to debug this is from the bottom up.

the rule

Page <- (Paragraph)* and a breakpoint in the PEGActor's Paragraph method
shows that the rule for Paragraph

Line <- Flow{1,"\n"}
Paragraph <- Line

while wrong(it wraps the entire page in a paragraph), IS being captured by
the Actor.

Interesting, the Paragraph is executed before the Page..this tells me that
its an inverse onion approach.

Get the inner rules correct and work outwards.

So, I have a broken Grammar for WikiText and now I have a method to approach
this problem.

1. try to define a rule.
2. put a breakpoint in the method that has a pragma for that rule.
3. ???
4. profilt

--
Sent from: http://forum.world.st/Squeak-Beginners-f107673.html
_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners