When reading an XML input stream, the XML framework provides mechanism to encode the input stream from a specific code page to the current code page (of the image).
When working with RSS feeds I noticed, that these streams are very often UTF-8 encoded - therefore the input stream is UTF-8 encoded. Several remarks about these situations: * one only can set ONE global user encoder (please consider 'Smalltalk-multitasking' systems) - this is a restriction * UTF-8 encoding to the current code page could result - in general - into unpredictable results ... * another idea is to tunnel the UTF-8 code stream WITHOUT encoding, but this is not possible without writing an own encoder (this is documented in the online-documentation). I think, that this option should be available (at least for NON-ASCII countries). WHEN TUNNELING the data (in UTF-8) you get into trouble, when the input stream has characters defined by multiple points (like the german umlauts) AND a XML definition of a character with code point value > 256 is also defined in the input stream. Due to the large code-point value the system creates instances of DBString (and UTF16 encoding), but the rest of the characters are left alone (and therefore are still in UTF8 encoding). You get an instance of DBString with multiple encoding schemata ..... The only solution seems to be: do NOT tunnel and hope, that the UTF-8 to current code page conversion mentioned above does NOT produce a wrong result AND that your Linux system is correctly defined (in terms of code pages). -- You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To view this discussion on the web visit https://groups.google.com/d/msg/va-smalltalk/-/kaeVD0738lkJ. To post to this group, send email to [hidden email]. To unsubscribe from this group, send email to [hidden email]. For more options, visit this group at http://groups.google.com/group/va-smalltalk?hl=en. |
I forgot to mention: if you do it the way the framework actually works (conversion to current code page), you still get (under the circumstances I mentioned) instances of DBString with different encodings: but now with Unicode code points and current code page code points.
The only solution is: check EVERY string you get from the XML framework and correct possible problems. But if you do the check, you may consider the UTF-8 NON conversion mentioned above just to throw away the current code page, which brings additional complexity. -- You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To view this discussion on the web visit https://groups.google.com/d/msg/va-smalltalk/-/Mer89b2066gJ. To post to this group, send email to [hidden email]. To unsubscribe from this group, send email to [hidden email]. For more options, visit this group at http://groups.google.com/group/va-smalltalk?hl=en. |
In reply to this post by Marten Feldtmann-2
One of these candidates - not working with Unicode characters - is AbtXmMappedText. It's valueStream is initialized via
<pre> valueStream := WriteStream on: ( Locale current preferredStringClass new ) ]. </pre> and preferredStringClass is here (in Germany) String, which is not able to work with multi-byte characters, Therefore all code with "Locale current preferredStringClass new" is a critical point for Unicode. -- You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To view this discussion on the web visit https://groups.google.com/d/msg/va-smalltalk/-/yqKZ7ZC4ClQJ. To post to this group, send email to [hidden email]. To unsubscribe from this group, send email to [hidden email]. For more options, visit this group at http://groups.google.com/group/va-smalltalk?hl=en. |
Free forum by Nabble | Edit this page |