Hello,
I have a question about what all these classes do (in a big picture) and how they work together and when they are actually called. I looked into the source code, but I am still having problems of fully understanding. When I have an Adapter with GRNullCodec I assume, that all (?) traffic, content (?) goes through the GRNullCodec, but due to the fact, that GRNullCodec does nothing, the traffic/content is not changed. What exactly goes through these codec ? If I use an adapter with GRPharoUtf8Codec is then the content converted to/from UTF8 ???? What does this mean to strings (in my application) I render on my pages like in the following command: html text: stringInSomeCodePage in both cases GRNullCodec and GRPharoUtf8Codec and with texts in specific code pages like Utf8, Latin1 and "true" Unicode (Utf32). In my firsts demos I held all my strings in UTF8 and used the GRNullCodec (and everything is ok in the browser side). Then I changed to GRPharoUtf8Codec and it seems to me, that I got now an additional UTF8 conversion. Then I switched my application strings back to Latin1 and it was ok again. How does this all work with Unicode characters with code points > 255 (and usage of GRPharoUtf8Codec) (in Squeak: WideString)? When is a GRPharoUtf8Codec really needed ?? Perhaps this is a stupid question .... but then I would like to know it :-)) Thanks for answering ! Marten _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
> When I have an Adapter with GRNullCodec I assume, that all (?) traffic,
> content (?) goes through the GRNullCodec, but due to the fact, that > GRNullCodec does nothing, the traffic/content is not changed. Yes. > If I use an adapter with GRPharoUtf8Codec is then the content converted > to/from UTF8 ???? This assumes that the outside world uses UTF8. That means incoming textual data is converted from UTF8 to Unicode. Outgoing textual data from Unicode to UTF8. > What does this mean to strings (in my application) I render on my pages like > in the following command: > > html text: stringInSomeCodePage If you typed 'stringInSomeCodePage' inside the image, you need to use GRPharoUtf8Codec otherwise it might show up as garbage. If all your strings come from outside the image (or you only have ASCII Strings inside the image) and you do not depend on the meaning of strings inside the image (that is you treat them as a binary sequence of bytes only), you can also use GRNullCodec. > In my firsts demos I held all my strings in UTF8 and used the GRNullCodec > (and everything is ok in the browser side). In the ASCII range Unicode and UTF8 are the same, so it does not matter what you use. > Then I changed to GRPharoUtf8Codec and it seems to me, that I got now an > additional UTF8 conversion. Then I switched my application strings back to > Latin1 and it was ok again. > > How does this all work with Unicode characters with code points > 255 (and > usage of GRPharoUtf8Codec) (in Squeak: WideString)? > > When is a GRPharoUtf8Codec really needed ?? You should not switch when you have actual String data, you will just get a mess of differently encoded Strings. Instead of the Utf8Codec you can also use the Latin1 codec, that means outgoing data will be Latin1, incoming data will be converted from Latin1. In practice I don't see a reason why you would want anything else but UTF8. > Perhaps this is a stupid question .... but then I would like to know it :-)) Philippe is the expert here. Check out his ESUG presentation: http://www.slideshare.net/esug/esug-unicode Lukas -- Lukas Renggli www.lukas-renggli.ch _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
In reply to this post by marten
Am 13.10.2011 um 18:12 schrieb Marten Feldtmann: > Hello, > > I have a question about what all these classes do (in a big picture) and how they work together and when they are actually called. I looked into the source code, but I am still having problems of fully understanding. > > When I have an Adapter with GRNullCodec I assume, that all (?) traffic, content (?) goes through the GRNullCodec, but due to the fact, that GRNullCodec does nothing, the traffic/content is not changed. > > What exactly goes through these codec ? > > If I use an adapter with GRPharoUtf8Codec is then the content converted to/from UTF8 ???? > > What does this mean to strings (in my application) I render on my pages like in the following command: > > html text: stringInSomeCodePage > > in both cases GRNullCodec and GRPharoUtf8Codec and with texts in specific code pages like Utf8, Latin1 and "true" Unicode (Utf32). > > In my firsts demos I held all my strings in UTF8 and used the GRNullCodec (and everything is ok in the browser side). > > Then I changed to GRPharoUtf8Codec and it seems to me, that I got now an additional UTF8 conversion. Then I switched my application strings back to Latin1 and it was ok again. > > How does this all work with Unicode characters with code points > 255 (and usage of GRPharoUtf8Codec) (in Squeak: WideString)? > > When is a GRPharoUtf8Codec really needed ?? > > Perhaps this is a stupid question .... but then I would like to know it :-)) > UTF8TextConverter is pharo specific. GR..Codec... are grease classes which you have to load separately. If you use (GRCodec forEncoding: 'utf-8') decode:/encode: then you get the platform specific encoding class for the platform you are on. Saying this is the cross dialect/platform way of doing. Finally if it comes to encoding you have to do it right at the border of a system where data is exchanged. Only if negotiation about encoding is in place and taken care of it works. In every other case where it is only slightly different it will fail the one or other way. And btw. UTF32 is no true unicode. Unicode is about the numeric mapping of symbols and particles. UTFXX is the encoding of unicode in an byte order independent way for unicode. These are either space efficient (UTF-8) or performance efficient (UTF-16, UTF-32). Norbert_______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
In reply to this post by marten
2011/10/13 Marten Feldtmann <[hidden email]>:
> Hello, > > I have a question about what all these classes do (in a big picture) and how > they work together and when they are actually called. I looked into the > source code, but I am still having problems of fully understanding. > > When I have an Adapter with GRNullCodec I assume, that all (?) traffic, > content (?) goes through the GRNullCodec, but due to the fact, that > GRNullCodec does nothing, the traffic/content is not changed. > > What exactly goes through these codec ? > > If I use an adapter with GRPharoUtf8Codec is then the content converted > to/from UTF8 ???? > > What does this mean to strings (in my application) I render on my pages like > in the following command: > > html text: stringInSomeCodePage > > in both cases GRNullCodec and GRPharoUtf8Codec and with texts in specific > code pages like Utf8, Latin1 and "true" Unicode (Utf32). > > In my firsts demos I held all my strings in UTF8 and used the GRNullCodec > (and everything is ok in the browser side). > > Then I changed to GRPharoUtf8Codec and it seems to me, that I got now an > additional UTF8 conversion. Then I switched my application strings back to > Latin1 and it was ok again. > > How does this all work with Unicode characters with code points > 255 (and > usage of GRPharoUtf8Codec) (in Squeak: WideString)? > > When is a GRPharoUtf8Codec really needed ?? > > Perhaps this is a stupid question .... but then I would like to know it :-)) So, … A codec is an object that handles string encodings, it has two main responsibilities: encodedString - #decode: -> decodedString decodedString - #encode: ->encodedString [1] for example utf-8 - #decode: -> "native" string "native" string - #encode: -> utf-8 "Native" means whatever the semantics in of a string are in the current dialect. That's ByteString/WideStream in Pharo, ByteString/TwoByteString/FourByteString in VW and so on. ä is 'ä', € is '€' and ☃ is '☃'. A codec stream is a stream that encodes the elements you write on it and passes them to an underlying stream. So for example you can write "native" strings to a codec stream and it encodes them and passes them to an underlying stream. You can ask a codec for a stream. GRPharoUtf8Codec and GRPharoUtf8CodecStream are the Pharo specific implementation classes for utf-8, they contain some fast path that only works in Pharo. They are part of Seaside. UTF8TextConverter is part of Pharo and implements utf-8 decoding and encoding. We use it for cases that are not covered by the fast path. GRNullCodec implements an identity transformation, #encode: and #decode: always answer the exact same string you passed them. It's there for historic reasons and because some non-single byte string classes have (had) quite severe bugs and performance regressions. So how does all this fit together? The codec on the server adapter #decodes: the request and #encodes: the response [2]. So when you set the codec on der server adapter to utf-8 the following is supposed to happen: request (utf-8) - #decode: -> "native" string response ("native" string) - #encode: -> utf-8 That means the strings you #render: or #text: have to be "native". They must not be in any encoding other than "Smalltalk". It also means the encoding the application reports to the browser has to be utf-8 (happens automatically unless you override it). So what if you set the codec to a NullCodec? Well nothing happens. request (whatever encoding) - #decode: -> whatever encoding response (whatever encoding) - #encode: -> whatever encoding You get strings where each character is a byte as sent by the browser and you're supposed to deliver strings where each character is byte in the same encoding. So in the case of utf-8 you would get: request (utf-8) - #decode: -> utf-8 response (utf-8) - #encode: -> utf-8 So instead of 'ä' you would get (String with: (Character value: 195) with: (Character value: 164)) (an ä encoded as utf-8). The same is true for the strings you #render: and #text:, they have to be utf-8 encoded already as well. It also means the encoding the application reports to the browser has to be utf-8. I hope this makes what's supposed to happen a bit more clear. [1] Yes, it's actually wrong that we have encoded strings we should have byte arrays instead. [2] Actually it's a bit more involved. It also has an #url codec that is responsible for the url encoding when rendering URLs. URLs do not necessarily have the same encoding as the HTML page on which they are rendered (yes). Cheers Philippe _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
Free forum by Nabble | Edit this page |