Smalltalk › Frameworks & Tools › Seaside › Seaside General

How to think about "Unicode spew"

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

7 messages Options

tty

Feb 13, 2020; 8:27pm

How to think about "Unicode spew"

780 posts

Hi Folks.

Over at http://menmachinesmaterials.com/WikitextParser ***

When hitting HamburgerIcon->Database->Random Page I occasionally get what
I call "Unicode spew"

Here is a portion of a page.
*<�!DOCTYPE html><�html class="no-js" lang="en"
dir="ltr"><�head><�title>WikitextParser<�/title><�meta
charset="utf-8"/><�link rel="stylesheet" type="text/css"
href="/files/WADevelopmentFiles/development.css"/>...*

However, on the image, if I run the page manually, the resulting XMLElement
looks just fine.

Here is the thing that caused the spew.

*<body> Thierry IV or Theoderic IV ({{circa}} 720{{spaced ndash}}c. 782)
was a Frankish <https://www.wikipedia.org/wiki/Franks> noble. Count of
Autun <https://www.wikipedia.org/wiki/Autun> and Toulouse
<https://www.wikipedia.org/wiki/Toulouse> ; he was thought to be a son of
Sigebert V <https://www.wikipedia.org/wiki/Sigebert_V> , and grandson of
Sigebert IV of Raze <https://www.wikipedia.org/wiki/Sigebert_IV_of_Raze> .
It is now well documented that his supposed Davidic blood was a hoax (see
Priory of Sion <https://www.wikipedia.org/wiki/Priory_of_Sion> ). Thierry
married Auda <https://www.wikipedia.org/wiki/Auda_of_France> , daughter of
Charles Martel <https://www.wikipedia.org/wiki/Charles_Martel> , sister of
Pepin III <https://www.wikipedia.org/wiki/Pepin_III> .
Children
<ul><li><a
href="https://www.wikipedia.org/wiki/William_of_Gellone">William of
Gellone</a> (755 – 28 May 812/4)</li><li>Alda of Gellone (born ca.
770); married Fredalon</li><li><a
href="https://www.wikipedia.org/wiki/Adalhelm_of_Autun">Adalhelm of
Autun</a></li></ul>{{Persondata <div/>| NAME = Thierry
04| ALTERNATIVE NAMES =| SHORT DESCRIPTION = Frankish noble| DATE OF BIRTH
=| PLACE OF BIRTH =| DATE OF DEATH =| PLACE OF DEATH
=}}{{DEFAULTSORT:Thierry 04}} Category:720s births
<https://www.wikipedia.org/wiki/Category:720s_births> Category:780s deaths
<https://www.wikipedia.org/wiki/Category:780s_deaths> Category:Counts of
Autun <https://www.wikipedia.org/wiki/Category:Counts_of_Autun>
Category:Counts of Toulouse
<https://www.wikipedia.org/wiki/Category:Counts_of_Toulouse>
Category:Frankish people
<https://www.wikipedia.org/wiki/Category:Frankish_people>
{{France-noble-stub}}</body>*

The method that posts the output is straightforward enough:

*renderParsedOn: html
| wikiGrammar wikiParser input actor|

actor := PEGWikiMediaGeneratorTables new.
actor transcripton
ifTrue:[ Transcript clear].

wikicode isNil
ifTrue:[input := '== Welcome To WikitextParserBrowser ==']
ifFalse:[input := wikicode].

wikiGrammar := PEGParser grammarWikiMediaTables reading positioning.
wikiParser := PEGParser parserPEG parse: 'Grammar' stream: wikiGrammar
actor: PEGParserParser new.
[[output := wikiParser parse: 'Page' stream: input actor: actor. ]
on: Error
do:[:ex | output := '
Error parsing. see Wikicode tab for source
']]
ensure:[
output := ((output asString copyReplaceAll: '<body>' with:'' )
copyReplaceTokens:'</body>' with:'') .
output := (output asString copyReplaceAll: '>' with:'>'
asTokens:false).
output := (output asString copyReplaceAll: '<' with:'<'
asTokens:false)].
html break;break.
html html: output.

*

Is there something I should be doing to "output" to make the garbage go
away?

thanks in advance
*** Alpha/Beta dev tool. If you get a DNU just hit the back button and try
again. Please do not hit Debug (:

--
Sent from: http://forum.world.st/Seaside-General-f86180.html
_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Karsten Kusche

Feb 14, 2020; 2:01pm

Re: How to think about "Unicode spew"

269 posts

That doesn’t look like an encoding problem. The only places where you have these question marks is right behind a <. Try to look at the source with a hex-editor to identify the actual character that’s placed behind <. My guess would be character 0 or something similar.

Karsten

Georg Heeg eK
Wallstraße 22
06366 Köthen

Tel.: 03496/214328
FAX: 03496/214712
Amtsgericht Dortmund HRA 12812

Am 13. Februar 2020 um 21:27:18, tty ([hidden email]) schrieb:

Hi Folks.

Over at http://menmachinesmaterials.com/WikitextParser ***

When hitting HamburgerIcon->Database->Random Page I occasionally get what
I call "Unicode spew"

Here is a portion of a page.
*<�!DOCTYPE html><�html class="no-js" lang="en"
dir="ltr"><�head><�title>WikitextParser<�/title><�meta
charset="utf-8"/><�link rel="stylesheet" type="text/css"
href="/files/WADevelopmentFiles/development.css"/>...*

However, on the image, if I run the page manually, the resulting XMLElement
looks just fine.

Here is the thing that caused the spew.

*<body> Thierry IV or Theoderic IV ({{circa}} 720{{spaced ndash}}c. 782)
was a Frankish <https://www.wikipedia.org/wiki/Franks> noble. Count of
Autun <https://www.wikipedia.org/wiki/Autun> and Toulouse
<https://www.wikipedia.org/wiki/Toulouse> ; he was thought to be a son of
Sigebert V <https://www.wikipedia.org/wiki/Sigebert_V> , and grandson of
Sigebert IV of Raze <https://www.wikipedia.org/wiki/Sigebert_IV_of_Raze> .
It is now well documented that his supposed Davidic blood was a hoax (see
Priory of Sion <https://www.wikipedia.org/wiki/Priory_of_Sion> ). Thierry
married Auda <https://www.wikipedia.org/wiki/Auda_of_France> , daughter of
Charles Martel <https://www.wikipedia.org/wiki/Charles_Martel> , sister of
Pepin III <https://www.wikipedia.org/wiki/Pepin_III> .
Children
<ul><li><a
href="https://www.wikipedia.org/wiki/William_of_Gellone">William of
Gellone</a> (755 – 28 May 812/4)</li><li>Alda of Gellone (born ca.
770); married Fredalon</li><li><a
href="https://www.wikipedia.org/wiki/Adalhelm_of_Autun">Adalhelm of
Autun</a></li></ul>{{Persondata <div/>| NAME = Thierry
04| ALTERNATIVE NAMES =| SHORT DESCRIPTION = Frankish noble| DATE OF BIRTH
=| PLACE OF BIRTH =| DATE OF DEATH =| PLACE OF DEATH
=}}{{DEFAULTSORT:Thierry 04}} Category:720s births
<https://www.wikipedia.org/wiki/Category:720s_births> Category:780s deaths
<https://www.wikipedia.org/wiki/Category:780s_deaths> Category:Counts of
Autun <https://www.wikipedia.org/wiki/Category:Counts_of_Autun>
Category:Counts of Toulouse
<https://www.wikipedia.org/wiki/Category:Counts_of_Toulouse>
Category:Frankish people
<https://www.wikipedia.org/wiki/Category:Frankish_people>
{{France-noble-stub}}</body>*

The method that posts the output is straightforward enough:

*renderParsedOn: html
| wikiGrammar wikiParser input actor|

actor := PEGWikiMediaGeneratorTables new.
actor transcripton
ifTrue:[ Transcript clear].

wikicode isNil
ifTrue:[input := '== Welcome To WikitextParserBrowser ==']
ifFalse:[input := wikicode].

wikiGrammar := PEGParser grammarWikiMediaTables reading positioning.
wikiParser := PEGParser parserPEG parse: 'Grammar' stream: wikiGrammar
actor: PEGParserParser new.
[[output := wikiParser parse: 'Page' stream: input actor: actor. ]
on: Error
do:[:ex | output := '
Error parsing. see Wikicode tab for source
']]
ensure:[
output := ((output asString copyReplaceAll: '<body>' with:'' )
copyReplaceTokens:'</body>' with:'') .
output := (output asString copyReplaceAll: '>' with:'>'
asTokens:false).
output := (output asString copyReplaceAll: '<' with:'<'
asTokens:false)].
html break;break.
html html: output.

*

Is there something I should be doing to "output" to make the garbage go
away?

thanks in advance
*** Alpha/Beta dev tool. If you get a DNU just hit the back button and try
again. Please do not hit Debug (:

--
Sent from: http://forum.world.st/Seaside-General-f86180.html
_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

... [show rest of quote]

_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Esteban A. Maringolo

Feb 14, 2020; 2:14pm

Re: How to think about "Unicode spew"

2343 posts

There is something weird in the code you shared, it has regular tags
<body> along things withing angular brackets ( <, >) that are URLs.
And following that you have < and > entities that seemed to
belong to tags.

I suggest you run the input through tidy [1] before doing any HTML
parsing, and avoid "string replacements" such as `copyReplaceAll:
'<' with:'<'`.

Regards,

[1] http://www.html-tidy.org/

Esteban A. Maringolo

On Fri, Feb 14, 2020 at 11:01 AM Karsten Kusche <[hidden email]> wrote:

>
> That doesn’t look like an encoding problem. The only places where you have these question marks is right behind a <. Try to look at the source with a hex-editor to identify the actual character that’s placed behind <. My guess would be character 0 or something similar.
>
> Karsten
>
> Georg Heeg eK
> Wallstraße 22
> 06366 Köthen
>
> Tel.: 03496/214328
> FAX: 03496/214712
> Amtsgericht Dortmund HRA 12812
>
>
> Am 13. Februar 2020 um 21:27:18, tty ([hidden email]) schrieb:
>
> Hi Folks.
>
> Over at http://menmachinesmaterials.com/WikitextParser ***
>
> When hitting HamburgerIcon->Database->Random Page I occasionally get what
> I call "Unicode spew"
>
> Here is a portion of a page.
> *<�!DOCTYPE html><�html class="no-js" lang="en"
> dir="ltr"><�head><�title>WikitextParser<�/title><�meta
> charset="utf-8"/><�link rel="stylesheet" type="text/css"
> href="/files/WADevelopmentFiles/development.css"/>...*
>
>
> However, on the image, if I run the page manually, the resulting XMLElement
> looks just fine.
>
> Here is the thing that caused the spew.
>
> *<body> Thierry IV or Theoderic IV ({{circa}} 720{{spaced ndash}}c. 782)
> was a Frankish <https://www.wikipedia.org/wiki/Franks> noble. Count of
> Autun <https://www.wikipedia.org/wiki/Autun> and Toulouse
> <https://www.wikipedia.org/wiki/Toulouse> ; he was thought to be a son of
> Sigebert V <https://www.wikipedia.org/wiki/Sigebert_V> , and grandson of
> Sigebert IV of Raze <https://www.wikipedia.org/wiki/Sigebert_IV_of_Raze> .
> It is now well documented that his supposed Davidic blood was a hoax (see
> Priory of Sion <https://www.wikipedia.org/wiki/Priory_of_Sion> ). Thierry
> married Auda <https://www.wikipedia.org/wiki/Auda_of_France> , daughter of
> Charles Martel <https://www.wikipedia.org/wiki/Charles_Martel> , sister of
> Pepin III <https://www.wikipedia.org/wiki/Pepin_III> .
> Children
> <ul><li><a
> href="https://www.wikipedia.org/wiki/William_of_Gellone">William of
> Gellone</a> (755 – 28 May 812/4)</li><li>Alda of Gellone (born ca.
> 770); married Fredalon</li><li><a
> href="https://www.wikipedia.org/wiki/Adalhelm_of_Autun">Adalhelm of
> Autun</a></li></ul>{{Persondata <div/>| NAME = Thierry
> 04| ALTERNATIVE NAMES =| SHORT DESCRIPTION = Frankish noble| DATE OF BIRTH
> =| PLACE OF BIRTH =| DATE OF DEATH =| PLACE OF DEATH
> =}}{{DEFAULTSORT:Thierry 04}} Category:720s births
> <https://www.wikipedia.org/wiki/Category:720s_births> Category:780s deaths
> <https://www.wikipedia.org/wiki/Category:780s_deaths> Category:Counts of
> Autun <https://www.wikipedia.org/wiki/Category:Counts_of_Autun>
> Category:Counts of Toulouse
> <https://www.wikipedia.org/wiki/Category:Counts_of_Toulouse>
> Category:Frankish people
> <https://www.wikipedia.org/wiki/Category:Frankish_people>
> {{France-noble-stub}}</body>*
>
>
> The method that posts the output is straightforward enough:
>
> *renderParsedOn: html
> | wikiGrammar wikiParser input actor|
>
> actor := PEGWikiMediaGeneratorTables new.
> actor transcripton
> ifTrue:[ Transcript clear].
>
> wikicode isNil
> ifTrue:[input := '== Welcome To WikitextParserBrowser ==']
> ifFalse:[input := wikicode].
>
> wikiGrammar := PEGParser grammarWikiMediaTables reading positioning.
> wikiParser := PEGParser parserPEG parse: 'Grammar' stream: wikiGrammar
> actor: PEGParserParser new.
> [[output := wikiParser parse: 'Page' stream: input actor: actor. ]
> on: Error
> do:[:ex | output := '
> Error parsing. see Wikicode tab for source
> ']]
> ensure:[
> output := ((output asString copyReplaceAll: '<body>' with:'' )
> copyReplaceTokens:'</body>' with:'') .
> output := (output asString copyReplaceAll: '>' with:'>'
> asTokens:false).
> output := (output asString copyReplaceAll: '<' with:'<'
> asTokens:false)].
> html break;break.
> html html: output.
>
> *
>
> Is there something I should be doing to "output" to make the garbage go
> away?
>
> thanks in advance
> *** Alpha/Beta dev tool. If you get a DNU just hit the back button and try
> again. Please do not hit Debug (:
>
>
>
> --
> Sent from: http://forum.world.st/Seaside-General-f86180.html
> _______________________________________________
> seaside mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>
> _______________________________________________
> seaside mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

... [show rest of quote]

_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

tty

Feb 17, 2020; 9:27pm

Re: How to think about "Unicode spew"

780 posts

In reply to this post by Karsten Kusche

Thank you, that makes sense.

"View Source" in Chrome shows those little empty squares...

Funny, I can use the XTerm browser "lynx" and it displays just fine.

thanks for the clue.

--
Sent from: http://forum.world.st/Seaside-General-f86180.html
_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

tty

Feb 17, 2020; 9:31pm

Re: How to think about "Unicode spew"

780 posts

In reply to this post by Esteban A. Maringolo

Hi Estaban,

Those escaped tags are an attribute of the XMLElements. XML escapes all
"tags" that are not part of the schema. So, to get those escaped tags to be
interpreted as xHTML in the browser, I have to change them back.

Wikitext(<sometag>) -> XMLElement (<sometag>)->Me(<sometag>)->Browser.

--
Sent from: http://forum.world.st/Seaside-General-f86180.html
_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

tty

Feb 17, 2020; 11:16pm

Re: How to think about "Unicode spew"

780 posts

In reply to this post by tty

Delved a bit deeper.

at this point: *output := wikiParser parse: 'Page' stream: input actor:
actor. *

ouput is an XMLElment.

My (lousy, btw) copyAndReplaceAll contortions convert it to a WideString.

I just discovered I can omit all that and just use a

* html html: output.* directly on the XMLElement.

So, the problem is either directly within the XMLElement or in the
"transition" between it and html render: output.

--
Sent from: http://forum.world.st/Seaside-General-f86180.html
_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

tty

Feb 17, 2020; 11:25pm

Re: How to think about "Unicode spew"

780 posts

In reply to this post by tty

Found it!

https://en.wikipedia.org/w/index.php?title=Thierry_IV&action=edit

This line:

*[[William of Gellone]] (755 – 28 May 812/4)

the "–" separating 755 and 28.

it blows up the entire page.

Internally, within the XMLElement, it is not rendered at all.

"(755 28 May 812/4)"

My hunch is that bunch of characters is a nice contribution from Microsoft
to help blow up the WWW.

any advise?

thanks for your time.

--
Sent from: http://forum.world.st/Seaside-General-f86180.html
_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside