Hi Folks.
Over at http://menmachinesmaterials.com/WikitextParser *** When hitting HamburgerIcon->Database->Random Page I occasionally get what I call "Unicode spew" Here is a portion of a page. *<�!DOCTYPE html><�html class="no-js" lang="en" dir="ltr"><�head><�title>WikitextParser<�/title><�meta charset="utf-8"/><�link rel="stylesheet" type="text/css" href="/files/WADevelopmentFiles/development.css"/>...* However, on the image, if I run the page manually, the resulting XMLElement looks just fine. Here is the thing that caused the spew. *<body><p> Thierry IV or Theoderic IV ({{circa}} 720{{spaced ndash}}c. 782) was a Frankish <https://www.wikipedia.org/wiki/Franks> noble. Count of Autun <https://www.wikipedia.org/wiki/Autun> and Toulouse <https://www.wikipedia.org/wiki/Toulouse> ; he was thought to be a son of Sigebert V <https://www.wikipedia.org/wiki/Sigebert_V> , and grandson of Sigebert IV of Raze <https://www.wikipedia.org/wiki/Sigebert_IV_of_Raze> . It is now well documented that his supposed Davidic blood was a hoax (see Priory of Sion <https://www.wikipedia.org/wiki/Priory_of_Sion> ). Thierry married Auda <https://www.wikipedia.org/wiki/Auda_of_France> , daughter of Charles Martel <https://www.wikipedia.org/wiki/Charles_Martel> , sister of Pepin III <https://www.wikipedia.org/wiki/Pepin_III> .</p> Children <ul><li><a href="https://www.wikipedia.org/wiki/William_of_Gellone">William of Gellone</a> (755 – 28 May 812/4)</li><li>Alda of Gellone (born ca. 770); married Fredalon</li><li><a href="https://www.wikipedia.org/wiki/Adalhelm_of_Autun">Adalhelm of Autun</a></li></ul><p>{{Persondata <div/>| NAME = Thierry 04| ALTERNATIVE NAMES =| SHORT DESCRIPTION = Frankish noble| DATE OF BIRTH =| PLACE OF BIRTH =| DATE OF DEATH =| PLACE OF DEATH =}}{{DEFAULTSORT:Thierry 04}} Category:720s births <https://www.wikipedia.org/wiki/Category:720s_births> Category:780s deaths <https://www.wikipedia.org/wiki/Category:780s_deaths> Category:Counts of Autun <https://www.wikipedia.org/wiki/Category:Counts_of_Autun> Category:Counts of Toulouse <https://www.wikipedia.org/wiki/Category:Counts_of_Toulouse> Category:Frankish people <https://www.wikipedia.org/wiki/Category:Frankish_people> </p><p>{{France-noble-stub}}</p></body>* The method that posts the output is straightforward enough: *renderParsedOn: html | wikiGrammar wikiParser input actor| actor := PEGWikiMediaGeneratorTables new. actor transcripton ifTrue:[ Transcript clear]. wikicode isNil ifTrue:[input := '== Welcome To WikitextParserBrowser =='] ifFalse:[input := wikicode]. wikiGrammar := PEGParser grammarWikiMediaTables reading positioning. wikiParser := PEGParser parserPEG parse: 'Grammar' stream: wikiGrammar actor: PEGParserParser new. [[output := wikiParser parse: 'Page' stream: input actor: actor. ] on: Error do:[:ex | output := ' Error parsing. see Wikicode tab for source ']] ensure:[ output := ((output asString copyReplaceAll: '<body>' with:'' ) copyReplaceTokens:'</body>' with:'') . output := (output asString copyReplaceAll: '>' with:'>' asTokens:false). output := (output asString copyReplaceAll: '<' with:'<' asTokens:false)]. html break;break. html html: output. * Is there something I should be doing to "output" to make the garbage go away? thanks in advance *** Alpha/Beta dev tool. If you get a DNU just hit the back button and try again. Please do not hit Debug (: -- Sent from: http://forum.world.st/Seaside-General-f86180.html _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
That doesn’t look like an encoding problem. The only places where you have these question marks is right behind a <. Try to look at the source with a hex-editor to identify the actual character that’s
placed behind <. My guess would be character 0 or something similar.
Karsten
Georg Heeg eK Wallstraße 22 06366 Köthen Tel.: 03496/214328 FAX: 03496/214712 Amtsgericht Dortmund HRA 12812 Am 13. Februar 2020 um 21:27:18, tty ([hidden email]) schrieb:
_______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
There is something weird in the code you shared, it has regular tags
<body><p> along things withing angular brackets ( <, >) that are URLs. And following that you have < and > entities that seemed to belong to tags. I suggest you run the input through tidy [1] before doing any HTML parsing, and avoid "string replacements" such as `copyReplaceAll: '<' with:'<'`. Regards, [1] http://www.html-tidy.org/ Esteban A. Maringolo On Fri, Feb 14, 2020 at 11:01 AM Karsten Kusche <[hidden email]> wrote: > > That doesn’t look like an encoding problem. The only places where you have these question marks is right behind a <. Try to look at the source with a hex-editor to identify the actual character that’s placed behind <. My guess would be character 0 or something similar. > > Karsten > > Georg Heeg eK > Wallstraße 22 > 06366 Köthen > > Tel.: 03496/214328 > FAX: 03496/214712 > Amtsgericht Dortmund HRA 12812 > > > Am 13. Februar 2020 um 21:27:18, tty ([hidden email]) schrieb: > > Hi Folks. > > Over at http://menmachinesmaterials.com/WikitextParser *** > > When hitting HamburgerIcon->Database->Random Page I occasionally get what > I call "Unicode spew" > > Here is a portion of a page. > *<�!DOCTYPE html><�html class="no-js" lang="en" > dir="ltr"><�head><�title>WikitextParser<�/title><�meta > charset="utf-8"/><�link rel="stylesheet" type="text/css" > href="/files/WADevelopmentFiles/development.css"/>...* > > > However, on the image, if I run the page manually, the resulting XMLElement > looks just fine. > > Here is the thing that caused the spew. > > *<body><p> Thierry IV or Theoderic IV ({{circa}} 720{{spaced ndash}}c. 782) > was a Frankish <https://www.wikipedia.org/wiki/Franks> noble. Count of > Autun <https://www.wikipedia.org/wiki/Autun> and Toulouse > <https://www.wikipedia.org/wiki/Toulouse> ; he was thought to be a son of > Sigebert V <https://www.wikipedia.org/wiki/Sigebert_V> , and grandson of > Sigebert IV of Raze <https://www.wikipedia.org/wiki/Sigebert_IV_of_Raze> . > It is now well documented that his supposed Davidic blood was a hoax (see > Priory of Sion <https://www.wikipedia.org/wiki/Priory_of_Sion> ). Thierry > married Auda <https://www.wikipedia.org/wiki/Auda_of_France> , daughter of > Charles Martel <https://www.wikipedia.org/wiki/Charles_Martel> , sister of > Pepin III <https://www.wikipedia.org/wiki/Pepin_III> .</p> > Children > <ul><li><a > href="https://www.wikipedia.org/wiki/William_of_Gellone">William of > Gellone</a> (755 – 28 May 812/4)</li><li>Alda of Gellone (born ca. > 770); married Fredalon</li><li><a > href="https://www.wikipedia.org/wiki/Adalhelm_of_Autun">Adalhelm of > Autun</a></li></ul><p>{{Persondata <div/>| NAME = Thierry > 04| ALTERNATIVE NAMES =| SHORT DESCRIPTION = Frankish noble| DATE OF BIRTH > =| PLACE OF BIRTH =| DATE OF DEATH =| PLACE OF DEATH > =}}{{DEFAULTSORT:Thierry 04}} Category:720s births > <https://www.wikipedia.org/wiki/Category:720s_births> Category:780s deaths > <https://www.wikipedia.org/wiki/Category:780s_deaths> Category:Counts of > Autun <https://www.wikipedia.org/wiki/Category:Counts_of_Autun> > Category:Counts of Toulouse > <https://www.wikipedia.org/wiki/Category:Counts_of_Toulouse> > Category:Frankish people > <https://www.wikipedia.org/wiki/Category:Frankish_people> > </p><p>{{France-noble-stub}}</p></body>* > > > The method that posts the output is straightforward enough: > > *renderParsedOn: html > | wikiGrammar wikiParser input actor| > > actor := PEGWikiMediaGeneratorTables new. > actor transcripton > ifTrue:[ Transcript clear]. > > wikicode isNil > ifTrue:[input := '== Welcome To WikitextParserBrowser =='] > ifFalse:[input := wikicode]. > > wikiGrammar := PEGParser grammarWikiMediaTables reading positioning. > wikiParser := PEGParser parserPEG parse: 'Grammar' stream: wikiGrammar > actor: PEGParserParser new. > [[output := wikiParser parse: 'Page' stream: input actor: actor. ] > on: Error > do:[:ex | output := ' > Error parsing. see Wikicode tab for source > ']] > ensure:[ > output := ((output asString copyReplaceAll: '<body>' with:'' ) > copyReplaceTokens:'</body>' with:'') . > output := (output asString copyReplaceAll: '>' with:'>' > asTokens:false). > output := (output asString copyReplaceAll: '<' with:'<' > asTokens:false)]. > html break;break. > html html: output. > > * > > Is there something I should be doing to "output" to make the garbage go > away? > > thanks in advance > *** Alpha/Beta dev tool. If you get a DNU just hit the back button and try > again. Please do not hit Debug (: > > > > -- > Sent from: http://forum.world.st/Seaside-General-f86180.html > _______________________________________________ > seaside mailing list > [hidden email] > http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside > > _______________________________________________ > seaside mailing list > [hidden email] > http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
In reply to this post by Karsten Kusche
Thank you, that makes sense.
"View Source" in Chrome shows those little empty squares... Funny, I can use the XTerm browser "lynx" and it displays just fine. thanks for the clue. -- Sent from: http://forum.world.st/Seaside-General-f86180.html _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
In reply to this post by Esteban A. Maringolo
Hi Estaban,
Those escaped tags are an attribute of the XMLElements. XML escapes all "tags" that are not part of the schema. So, to get those escaped tags to be interpreted as xHTML in the browser, I have to change them back. Wikitext(<sometag>) -> XMLElement (<sometag>)->Me(<sometag>)->Browser. -- Sent from: http://forum.world.st/Seaside-General-f86180.html _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
In reply to this post by tty
Delved a bit deeper.
at this point: *output := wikiParser parse: 'Page' stream: input actor: actor. * ouput is an XMLElment. My (lousy, btw) copyAndReplaceAll contortions convert it to a WideString. I just discovered I can omit all that and just use a * html html: output.* directly on the XMLElement. So, the problem is either directly within the XMLElement or in the "transition" between it and html render: output. -- Sent from: http://forum.world.st/Seaside-General-f86180.html _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
In reply to this post by tty
Found it!
https://en.wikipedia.org/w/index.php?title=Thierry_IV&action=edit This line: *[[William of Gellone]] (755 – 28 May 812/4) the "–" separating 755 and 28. it blows up the entire page. Internally, within the XMLElement, it is not rendered at all. "(755 28 May 812/4)" My hunch is that bunch of characters is a nice contribution from Microsoft to help blow up the WWW. any advise? thanks for your time. -- Sent from: http://forum.world.st/Seaside-General-f86180.html _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
Free forum by Nabble | Edit this page |