Smalltalk › Frameworks & Tools › Seaside › Seaside General

UTF8TextConverter, GRPharoUtf8Codec, GRPharoUtf8CodecStream against GRNullCodec

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

marten

UTF8TextConverter, GRPharoUtf8Codec, GRPharoUtf8CodecStream against GRNullCodec

Hello,

I have a question about what all these classes do (in a big picture) and
how they work together and when they are actually called. I looked into
the source code, but I am still having problems of fully understanding.

When I have an Adapter with GRNullCodec I assume, that all (?) traffic,
content (?) goes through the GRNullCodec, but due to the fact, that
GRNullCodec does nothing, the traffic/content is not changed.

What exactly goes through these codec ?

If I use an adapter with GRPharoUtf8Codec is then the content converted
to/from UTF8 ????

What does this mean to strings (in my application) I render on my pages
like in the following command:

html text: stringInSomeCodePage

in both cases GRNullCodec and GRPharoUtf8Codec and with texts in
specific code pages like Utf8, Latin1 and "true" Unicode (Utf32).

In my firsts demos I held all my strings in UTF8 and used the
GRNullCodec (and everything is ok in the browser side).

Then I changed to GRPharoUtf8Codec and it seems to me, that I got now an
additional UTF8 conversion. Then I switched my application strings back
to Latin1 and it was ok again.

How does this all work with Unicode characters with code points > 255
(and usage of GRPharoUtf8Codec) (in Squeak: WideString)?

When is a GRPharoUtf8Codec really needed ??

Perhaps this is a stupid question .... but then I would like to know it :-))

Thanks for answering !

Marten

_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Lukas Renggli

Re: UTF8TextConverter, GRPharoUtf8Codec, GRPharoUtf8CodecStream against GRNullCodec

> When I have an Adapter with GRNullCodec I assume, that all (?) traffic,
> content (?) goes through the GRNullCodec, but due to the fact, that
> GRNullCodec does nothing, the traffic/content is not changed.

Yes.

> If I use an adapter with GRPharoUtf8Codec is then the content converted
> to/from UTF8 ????

This assumes that the outside world uses UTF8. That means incoming
textual data is converted from UTF8 to Unicode. Outgoing textual data
from Unicode to UTF8.

> What does this mean to strings (in my application) I render on my pages like
> in the following command:
>
> html text: stringInSomeCodePage

If you typed 'stringInSomeCodePage' inside the image, you need to use
GRPharoUtf8Codec otherwise it might show up as garbage.

If all your strings come from outside the image (or you only have
ASCII Strings inside the image) and you do not depend on the meaning
of strings inside the image (that is you treat them as a binary
sequence of bytes only), you can also use GRNullCodec.

> In my firsts demos I held all my strings in UTF8 and used the GRNullCodec
> (and everything is ok in the browser side).

In the ASCII range Unicode and UTF8 are the same, so it does not
matter what you use.

> Then I changed to GRPharoUtf8Codec and it seems to me, that I got now an
> additional UTF8 conversion. Then I switched my application strings back to
> Latin1 and it was ok again.
>
> How does this all work with Unicode characters with code points > 255 (and
> usage of GRPharoUtf8Codec) (in Squeak: WideString)?
>
> When is a GRPharoUtf8Codec really needed ??

You should not switch when you have actual String data, you will just
get a mess of differently encoded Strings. Instead of the Utf8Codec
you can also use the Latin1 codec, that means outgoing data will be
Latin1, incoming data will be converted from Latin1.

In practice I don't see a reason why you would want anything else but UTF8.

> Perhaps this is a stupid question .... but then I would like to know it :-))

Philippe is the expert here. Check out his ESUG presentation:

http://www.slideshare.net/esug/esug-unicode

Lukas

--
Lukas Renggli
www.lukas-renggli.ch
_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

NorbertHartl

Re: UTF8TextConverter, GRPharoUtf8Codec, GRPharoUtf8CodecStream against GRNullCodec

In reply to this post by marten

Am 13.10.2011 um 18:12 schrieb Marten Feldtmann:

> Hello,
>
> I have a question about what all these classes do (in a big picture) and how they work together and when they are actually called. I looked into the source code, but I am still having problems of fully understanding.
>
> When I have an Adapter with GRNullCodec I assume, that all (?) traffic, content (?) goes through the GRNullCodec, but due to the fact, that GRNullCodec does nothing, the traffic/content is not changed.
>
> What exactly goes through these codec ?
>
> If I use an adapter with GRPharoUtf8Codec is then the content converted to/from UTF8 ????
>
> What does this mean to strings (in my application) I render on my pages like in the following command:
>
> html text: stringInSomeCodePage
>
> in both cases GRNullCodec and GRPharoUtf8Codec and with texts in specific code pages like Utf8, Latin1 and "true" Unicode (Utf32).
>
> In my firsts demos I held all my strings in UTF8 and used the GRNullCodec (and everything is ok in the browser side).
>
> Then I changed to GRPharoUtf8Codec and it seems to me, that I got now an additional UTF8 conversion. Then I switched my application strings back to Latin1 and it was ok again.
>
> How does this all work with Unicode characters with code points > 255 (and usage of GRPharoUtf8Codec) (in Squeak: WideString)?
>
> When is a GRPharoUtf8Codec really needed ??
>
> Perhaps this is a stupid question .... but then I would like to know it :-))
>

The rule of thumb is that if you create a string inside the image it is a collection of characters answering there asciiValue as unicode code points when being asked. If you get your string from outside the image then you are only safe if you negotiate with the outside world. In a HTTP scenario you should pick the character encoding from the HTTP headers. There is no way of knowing the encoding upfront. In a web environment it is kind of secure to assume to get back the same encoding you've send to the client because they obey as far as I know.

UTF8TextConverter is pharo specific. GR..Codec... are grease classes which you have to load separately. If you use

(GRCodec forEncoding: 'utf-8') decode:/encode:

then you get the platform specific encoding class for the platform you are on. Saying this is the cross dialect/platform way of doing. Finally if it comes to encoding you have to do it right at the border of a system where data is exchanged. Only if negotiation about encoding is in place and taken care of it works. In every other case where it is only slightly different it will fail the one or other way.

And btw. UTF32 is no true unicode. Unicode is about the numeric mapping of symbols and particles. UTFXX is the encoding of unicode in an byte order independent way for unicode. These are either space efficient (UTF-8) or performance efficient (UTF-16, UTF-32).

Norbert_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Philippe Marschall

Re: UTF8TextConverter, GRPharoUtf8Codec, GRPharoUtf8CodecStream against GRNullCodec

In reply to this post by marten

2011/10/13 Marten Feldtmann <[hidden email]>:

> Hello,
>
> I have a question about what all these classes do (in a big picture) and how
> they work together and when they are actually called. I looked into the
> source code, but I am still having problems of fully understanding.
>
> When I have an Adapter with GRNullCodec I assume, that all (?) traffic,
> content (?) goes through the GRNullCodec, but due to the fact, that
> GRNullCodec does nothing, the traffic/content is not changed.
>
> What exactly goes through these codec ?
>
> If I use an adapter with GRPharoUtf8Codec is then the content converted
> to/from UTF8 ????
>
> What does this mean to strings (in my application) I render on my pages like
> in the following command:
>
> html text: stringInSomeCodePage
>
> in both cases GRNullCodec and GRPharoUtf8Codec and with texts in specific
> code pages like Utf8, Latin1 and "true" Unicode (Utf32).
>
> In my firsts demos I held all my strings in UTF8 and used the GRNullCodec
> (and everything is ok in the browser side).
>
> Then I changed to GRPharoUtf8Codec and it seems to me, that I got now an
> additional UTF8 conversion. Then I switched my application strings back to
> Latin1 and it was ok again.
>
> How does this all work with Unicode characters with code points > 255 (and
> usage of GRPharoUtf8Codec) (in Squeak: WideString)?
>
> When is a GRPharoUtf8Codec really needed ??
>
> Perhaps this is a stupid question .... but then I would like to know it :-))

So, …

A codec is an object that handles string encodings, it has two main
responsibilities:
encodedString - #decode: -> decodedString
decodedString - #encode: ->encodedString

[1] for example

utf-8 - #decode: -> "native" string
"native" string - #encode: -> utf-8

"Native" means whatever the semantics in of a string are in the
current dialect. That's ByteString/WideStream in Pharo,
ByteString/TwoByteString/FourByteString in VW and so on. ä is 'ä', €
is '€' and ☃ is '☃'.

A codec stream is a stream that encodes the elements you write on it
and passes them to an underlying stream. So for example you can write
"native" strings to a codec stream and it encodes them and passes them
to an underlying stream. You can ask a codec for a stream.

GRPharoUtf8Codec and GRPharoUtf8CodecStream are the Pharo specific
implementation classes for utf-8, they contain some fast path that
only works in Pharo. They are part of Seaside. UTF8TextConverter is
part of Pharo and implements utf-8 decoding and encoding. We use it
for cases that are not covered by the fast path.

GRNullCodec implements an identity transformation, #encode: and
#decode: always answer the exact same string you passed them. It's
there for historic reasons and because some non-single byte string
classes have (had) quite severe bugs and performance regressions.

So how does all this fit together?

The codec on the server adapter #decodes: the request and #encodes:
the response [2].

So when you set the codec on der server adapter to utf-8 the following
is supposed to happen:
request (utf-8) - #decode: -> "native" string
response ("native" string) - #encode: -> utf-8

That means the strings you #render: or #text: have to be "native".
They must not be in any encoding other than "Smalltalk". It also means
the encoding the application reports to the browser has to be utf-8
(happens automatically unless you override it).

So what if you set the codec to a NullCodec? Well nothing happens.

request (whatever encoding) - #decode: -> whatever encoding
response (whatever encoding) - #encode: -> whatever encoding

You get strings where each character is a byte as sent by the browser
and you're supposed to deliver strings where each character is byte in
the same encoding.

So in the case of utf-8 you would get:

request (utf-8) - #decode: -> utf-8
response (utf-8) - #encode: -> utf-8

So instead of 'ä' you would get (String with: (Character value: 195)
with: (Character value: 164)) (an ä encoded as utf-8). The same is
true for the strings you #render: and #text:, they have to be utf-8
encoded already as well. It also means the encoding the application
reports to the browser has to be utf-8.

I hope this makes what's supposed to happen a bit more clear.

[1] Yes, it's actually wrong that we have encoded strings we should
have byte arrays instead.
[2] Actually it's a bit more involved. It also has an #url codec that
is responsible for the url encoding when rendering URLs. URLs do not
necessarily have the same encoding as the HTML page on which they are
rendered (yes).

Cheers
Philippe
_______________________________________________
seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside