[9.1] AbtXMLDOMParser suddenly doesn't like Umlauts any more

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[9.1] AbtXMLDOMParser suddenly doesn't like Umlauts any more

jtuchel
I am in the middle of migrating our Application to VAST 9.1. One of the biggest reasons for our migration is that we now finally have a working GRVASTUtf8Codec to use in Seaside and get rid of a lot of problems with special characters.

Unfortunately, this migration is one of the hardest I ever encountered. Not because of VAST, but because of UTF-8 and its pitfalls.

One thing that is broken in our shiny new VAST 9.1 image is a simple RSS reader that loads a few blog posts from our web site and displays the latest three of them in our Web App.

While the code is totally fine in 8.6.3, it suddenly breaks in 9.1


Here is the code we used in 8.6.3:

    | stream xml parser dom result |

    parser := AbtXmlDOMParser newNonValidatingParser.
   
    stream := CfsReadFileStream open: 'feed.xml'.

    stream isCfsError ifTrue: [^nil].
    [xml := stream contents. ]
        ensure: [stream close].

    xml := xml convertFromCodePage: 'UTF-8'.
    xml := ReadStream on: xml.
    dom := parser parse: xml


In 9.1, however, it breaks with an SGMLSyntaxError. I found out it is because the xml stream gets "cut off" at an 'ö' (German Umlaut o)  and thus a closing tag cannot be found. If you want to test it, use this cUrl command to get the file:

curl -L https://kontolino.de/feed > feed.xml

I found a way to "fix" this. I just had to add

    parser decodingEnabled: false.


before parsing. So I could be happy.


But there are a few questions in my head I'd like to ask:

  • My code seems a bit clumsy and somehow  the AbtXMLInputSource / AbtXMLFile classes and friends somehow look as if they were prepared for parsing and decoding in one step. I just can't find how to properly set them up. Any hints or examples?
  • I wonder why this breaks in VAST 9.1. What has changed? I must have missed something in the migration guide?
Thanks

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/va-smalltalk.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [9.1] AbtXMLDOMParser suddenly doesn't like Umlauts any more

Mariano Martinez Peck-2
Hi Joachim,

I created a support cast for it (63997) and we can continue it there. I will share back the results to the forum in case we find something useful.

Best, 

On Tue, Sep 4, 2018 at 11:48 AM Joachim Tuchel <[hidden email]> wrote:
I am in the middle of migrating our Application to VAST 9.1. One of the biggest reasons for our migration is that we now finally have a working GRVASTUtf8Codec to use in Seaside and get rid of a lot of problems with special characters.

Unfortunately, this migration is one of the hardest I ever encountered. Not because of VAST, but because of UTF-8 and its pitfalls.

One thing that is broken in our shiny new VAST 9.1 image is a simple RSS reader that loads a few blog posts from our web site and displays the latest three of them in our Web App.

While the code is totally fine in 8.6.3, it suddenly breaks in 9.1


Here is the code we used in 8.6.3:

    | stream xml parser dom result |

    parser := AbtXmlDOMParser newNonValidatingParser.
   
    stream := CfsReadFileStream open: 'feed.xml'.

    stream isCfsError ifTrue: [^nil].
    [xml := stream contents. ]
        ensure: [stream close].

    xml := xml convertFromCodePage: 'UTF-8'.
    xml := ReadStream on: xml.
    dom := parser parse: xml


In 9.1, however, it breaks with an SGMLSyntaxError. I found out it is because the xml stream gets "cut off" at an 'ö' (German Umlaut o)  and thus a closing tag cannot be found. If you want to test it, use this cUrl command to get the file:

curl -L https://kontolino.de/feed > feed.xml

I found a way to "fix" this. I just had to add

    parser decodingEnabled: false.


before parsing. So I could be happy.


But there are a few questions in my head I'd like to ask:

  • My code seems a bit clumsy and somehow  the AbtXMLInputSource / AbtXMLFile classes and friends somehow look as if they were prepared for parsing and decoding in one step. I just can't find how to properly set them up. Any hints or examples?
  • I wonder why this breaks in VAST 9.1. What has changed? I must have missed something in the migration guide?
Thanks

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/va-smalltalk.
For more options, visit https://groups.google.com/d/optout.


--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.
[hidden email]

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/va-smalltalk.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [9.1] AbtXMLDOMParser suddenly doesn't like Umlauts any more

jtuchel
Just for the record: This issue has been solved with Mariano's help

It was of course my fault: I somehow managed to convert "to UTF-8" twice, and the second conversion broke things. Going back to not converting by hand and not touching the decodingEnabled: made everything work like a charm.

VAST 9.1 is really nice and once I figured all my misconceptions about UTF-8 conversions and more or less unrelated things I never dreamed of (like Mac filenames using another UTF-8 character than most other OSes - only slightly related to UTF-8 at all) I am happily serving web pages from Seaside that happen to have nice special characters also when served as a response to Ajax calls ....

Keep up the good work!
 

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/va-smalltalk.
For more options, visit https://groups.google.com/d/optout.