primitive 468

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

primitive 468

NorbertHartl
I'm asking myself at the moment if primitive 468 will also work for bytes instead of characters. I've taken the method from String

String>>_unicodePrim: opCode

"opCode 0 - Encode receiver in UTF8 format. If all characters in receiver
            are US-ASCII, answer the receiver. Result is a String.
 opCode 1 - Decode receiver from UTF8 format into either a String or
            QuadByteString depending upon the range of characters involved.
 opCode 2 - Decode receiver from UTF8 format into either a String or
            Double/QuadByteString depending upon the range of characters
            involved."

<primitive: 468>
self _primitiveFailed: #_unicodePrim:

As it is going native there isn't a real distinctions between an 8-bit character and a byte, right?. But I want to have an estimation before I try to bork my image :)

The rationale behind it is again utf-8 handling. In gemstone everything is a string regardless if it is encoded or not. To be honest I don't think this is a good idea. To get a clear understanding of the issue it should be possible to separate things. An easy to follow rule might be that there is

string -> encoder -> byte array
byte array -> decoder -> string

That behaviour is somewhat reflected in the GRPharoUtf8Codec but different in gemstone. In my own code I like to handle it the way I sketched it. On the one hand I see a large potential for different opinions about the string issue. On the other hand I think that regardless how strings work it should be possible to decode a byte array to a string. If you would tell me that it might work to copy decodeFromUtf8 and _unicodePrim: from String to ByteArray than I would give it a try. I saw already that I need to be the SystemUser in order to do it. Seems like a new excerize :)

Norbert

Reply | Threaded
Open this post in threaded view
|

Re: primitive 468

NorbertHartl
Hmmm,

I read it again and some things are misleading. The behaviour in pharo is not different than the one in gemstone. I just wondered why decode: expects a string. It seems that for byte array one needs to convert into a string or to use the codec stream instead.

Norbert
On 02.12.2010, at 17:11, Norbert Hartl wrote:

> I'm asking myself at the moment if primitive 468 will also work for bytes instead of characters. I've taken the method from String
>
> String>>_unicodePrim: opCode
>
> "opCode 0 - Encode receiver in UTF8 format. If all characters in receiver
>            are US-ASCII, answer the receiver. Result is a String.
> opCode 1 - Decode receiver from UTF8 format into either a String or
>            QuadByteString depending upon the range of characters involved.
> opCode 2 - Decode receiver from UTF8 format into either a String or
>            Double/QuadByteString depending upon the range of characters
>            involved."
>
> <primitive: 468>
> self _primitiveFailed: #_unicodePrim:
>
> As it is going native there isn't a real distinctions between an 8-bit character and a byte, right?. But I want to have an estimation before I try to bork my image :)
>
> The rationale behind it is again utf-8 handling. In gemstone everything is a string regardless if it is encoded or not. To be honest I don't think this is a good idea. To get a clear understanding of the issue it should be possible to separate things. An easy to follow rule might be that there is
>
> string -> encoder -> byte array
> byte array -> decoder -> string
>
> That behaviour is somewhat reflected in the GRPharoUtf8Codec but different in gemstone. In my own code I like to handle it the way I sketched it. On the one hand I see a large potential for different opinions about the string issue. On the other hand I think that regardless how strings work it should be possible to decode a byte array to a string. If you would tell me that it might work to copy decodeFromUtf8 and _unicodePrim: from String to ByteArray than I would give it a try. I saw already that I need to be the SystemUser in order to do it. Seems like a new excerize :)
>
> Norbert
>

Reply | Threaded
Open this post in threaded view
|

Re: primitive 468

Dale Henrichs
In reply to this post by NorbertHartl
Norbert,

I think the code is written to work pretty much on Strings, but a
variant of the primitive could be written to produce ByteArrays and vice
versa, a primitive could be written to take ByteArrays and produce
strings afer decoding ...

I would imagine that at the very bottom the reason that a UTF8 encoded
String is returned as a String is that for a USASCII String, they are
identical collections of bytes.

The original primitive was written when all of the encoding was done
from String to String in the original Smalltalk-base algorithm and in
Seaside too (I think). Recently the Seaside folks switched to using
ByteArrays and I didn't switch because the primitives were written to
operate on and produce Strings.

In Pharo I have noticed that ByteArrays are accepted in a bunch of the
String primitives when they probably shouldn't be (the primitive failure
code is expecting Strings, but the primitive calls allow ByteArrays)
which leads to code in Pharo where it doesn't matter whether you have a
String or a ByteArray ... GemStone's primitives are more stringent
(which is how I've discovered the "loose" Pharo primitives)....

In the end I agree that ByteArrays should be used for the encoded
Strings....

It looks like primitive 468 requires a String for the decode operation:(

Dale


On 12/02/2010 08:11 AM, Norbert Hartl wrote:

> I'm asking myself at the moment if primitive 468 will also work for
> bytes instead of characters. I've taken the method from String
>
> String>>_unicodePrim: opCode
>
> "opCode 0 - Encode receiver in UTF8 format. If all characters in
> receiver are US-ASCII, answer the receiver. Result is a String.
> opCode 1 - Decode receiver from UTF8 format into either a String or
> QuadByteString depending upon the range of characters involved.
> opCode 2 - Decode receiver from UTF8 format into either a String or
> Double/QuadByteString depending upon the range of characters
> involved."
>
> <primitive: 468> self _primitiveFailed: #_unicodePrim:
>
> As it is going native there isn't a real distinctions between an
> 8-bit character and a byte, right?. But I want to have an estimation
> before I try to bork my image :)
>
> The rationale behind it is again utf-8 handling. In gemstone
> everything is a string regardless if it is encoded or not. To be
> honest I don't think this is a good idea. To get a clear
> understanding of the issue it should be possible to separate things.
> An easy to follow rule might be that there is
>
> string ->  encoder ->  byte array byte array ->  decoder ->  string
>
> That behaviour is somewhat reflected in the GRPharoUtf8Codec but
> different in gemstone. In my own code I like to handle it the way I
> sketched it. On the one hand I see a large potential for different
> opinions about the string issue. On the other hand I think that
> regardless how strings work it should be possible to decode a byte
> array to a string. If you would tell me that it might work to copy
> decodeFromUtf8 and _unicodePrim: from String to ByteArray than I
> would give it a try. I saw already that I need to be the SystemUser
> in order to do it. Seems like a new excerize :)
>
> Norbert
>

Reply | Threaded
Open this post in threaded view
|

Re: primitive 468

Philippe Marschall
In reply to this post by NorbertHartl
2010/12/2 Norbert Hartl <[hidden email]>:

> I'm asking myself at the moment if primitive 468 will also work for bytes instead of characters. I've taken the method from String
>
> String>>_unicodePrim: opCode
>
> "opCode 0 - Encode receiver in UTF8 format. If all characters in receiver
>            are US-ASCII, answer the receiver. Result is a String.
>  opCode 1 - Decode receiver from UTF8 format into either a String or
>            QuadByteString depending upon the range of characters involved.
>  opCode 2 - Decode receiver from UTF8 format into either a String or
>            Double/QuadByteString depending upon the range of characters
>            involved."
>
> <primitive: 468>
> self _primitiveFailed: #_unicodePrim:
>
> As it is going native there isn't a real distinctions between an 8-bit character and a byte, right?. But I want to have an estimation before I try to bork my image :)
>
> The rationale behind it is again utf-8 handling. In gemstone everything is a string regardless if it is encoded or not. To be honest I don't think this is a good idea. To get a clear understanding of the issue it should be possible to separate things. An easy to follow rule might be that there is
>
> string -> encoder -> byte array
> byte array -> decoder -> string

I agree that this is the long term solution where we want to be but
for mostly history reasons that's not how it currently works in
Seaside and Pharo. The current situation in Seaside is:

string -> encoder -> string
string/byte array -> decoder -> string

It's on our todo list to change this, but not in the short term.

Cheers
Philippe