Hi Folks, In XTreams parsing, the grammarWiki/PEGWikiGenerator combo do not parse the Wikimedia headings. I have copied the grammarWiki to grammarWikiMedia and I am slowly building it up in an attempt to isolate the problem. Grammar looks like this:
For the Actor, I have copied the PEGWikiGenerator, saving it as PEGWikiMediaGenerator. I have made some minor additions to support H5 and H6 heading levels per Wikimedia standards. My problem, is that Wikimedia seems to like to wrap its <hN></hN> tags within a paragraph: <p><hN></hN></p> So, while I can parse this input just ducky:
Producing an XMLElement looking like this:
When I wrap the <hn> elements in <p> tags for this input...
my XMLElement looks like this:
I am supposing that I have a wayward Grammar specification. Where should I focus? Should I hack at
and change "Whitespace" to something else ? Or should I redefine the
duo? If a general principle exists that will guide me going forward, I would very much appreciate it. Thank you in advance. t _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
Figured out how to approach it.
Study PEG Parsing Expression Grammar https://en.wikipedia.org/wiki/Parsing_expression_grammar -- Sent from: http://forum.world.st/Squeak-Beginners-f107673.html _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
In reply to this post by tty
Slowly grokking the big picture.
A Gramamar is a bunch of PEG rules in the grammarWiki (and now grammarMediaWiki) the "top rule" is Page <- (Preformatted / Code / UnorderedList / OrderedList / Heading / Table / Paragraph / Empty)* which is a a grouping of zero or more subrules. The rule for Preformatted looks like this: Preformatted <- "---\n" .{"---\n"} Now here is the neat thing. In the PEGActor subclass for that grammar PEGWikiGenerator, there are a bunch of methods. However, within many of those methods are PRAGMAS. Here is the Preformatted method: Preformatted: text <action: 'Preformatted' arguments: #(2)> <action: 'Code' arguments: #(2)> ^self newElementTag: Preformatted elements: (Array with: (self newText: text)) See those pragmas? They match up exactly with the Grammar rules. Somehow, something, parses the grammars and then goes looking in the Actor for pragmas. When a pragma in a method matches a rule, then I "think" that a block is stored in the PEGParser and when that rule is encountered during the parse, that block is executed. -- Sent from: http://forum.world.st/Squeak-Beginners-f107673.html _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
In reply to this post by tty
So, to answer my initial questions...
The way to debug this is from the bottom up. the rule Page <- (Paragraph)* and a breakpoint in the PEGActor's Paragraph method shows that the rule for Paragraph Line <- Flow{1,"\n"} Paragraph <- Line while wrong(it wraps the entire page in a paragraph), IS being captured by the Actor. Interesting, the Paragraph is executed before the Page..this tells me that its an inverse onion approach. Get the inner rules correct and work outwards. So, I have a broken Grammar for WikiText and now I have a method to approach this problem. 1. try to define a rule. 2. put a breakpoint in the method that has a pragma for that rule. 3. ??? 4. profilt -- Sent from: http://forum.world.st/Squeak-Beginners-f107673.html _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
Free forum by Nabble | Edit this page |