I'm asking myself at the moment if primitive 468 will also work for bytes instead of characters. I've taken the method from String
String>>_unicodePrim: opCode "opCode 0 - Encode receiver in UTF8 format. If all characters in receiver are US-ASCII, answer the receiver. Result is a String. opCode 1 - Decode receiver from UTF8 format into either a String or QuadByteString depending upon the range of characters involved. opCode 2 - Decode receiver from UTF8 format into either a String or Double/QuadByteString depending upon the range of characters involved." <primitive: 468> self _primitiveFailed: #_unicodePrim: As it is going native there isn't a real distinctions between an 8-bit character and a byte, right?. But I want to have an estimation before I try to bork my image :) The rationale behind it is again utf-8 handling. In gemstone everything is a string regardless if it is encoded or not. To be honest I don't think this is a good idea. To get a clear understanding of the issue it should be possible to separate things. An easy to follow rule might be that there is string -> encoder -> byte array byte array -> decoder -> string That behaviour is somewhat reflected in the GRPharoUtf8Codec but different in gemstone. In my own code I like to handle it the way I sketched it. On the one hand I see a large potential for different opinions about the string issue. On the other hand I think that regardless how strings work it should be possible to decode a byte array to a string. If you would tell me that it might work to copy decodeFromUtf8 and _unicodePrim: from String to ByteArray than I would give it a try. I saw already that I need to be the SystemUser in order to do it. Seems like a new excerize :) Norbert |
Hmmm,
I read it again and some things are misleading. The behaviour in pharo is not different than the one in gemstone. I just wondered why decode: expects a string. It seems that for byte array one needs to convert into a string or to use the codec stream instead. Norbert On 02.12.2010, at 17:11, Norbert Hartl wrote: > I'm asking myself at the moment if primitive 468 will also work for bytes instead of characters. I've taken the method from String > > String>>_unicodePrim: opCode > > "opCode 0 - Encode receiver in UTF8 format. If all characters in receiver > are US-ASCII, answer the receiver. Result is a String. > opCode 1 - Decode receiver from UTF8 format into either a String or > QuadByteString depending upon the range of characters involved. > opCode 2 - Decode receiver from UTF8 format into either a String or > Double/QuadByteString depending upon the range of characters > involved." > > <primitive: 468> > self _primitiveFailed: #_unicodePrim: > > As it is going native there isn't a real distinctions between an 8-bit character and a byte, right?. But I want to have an estimation before I try to bork my image :) > > The rationale behind it is again utf-8 handling. In gemstone everything is a string regardless if it is encoded or not. To be honest I don't think this is a good idea. To get a clear understanding of the issue it should be possible to separate things. An easy to follow rule might be that there is > > string -> encoder -> byte array > byte array -> decoder -> string > > That behaviour is somewhat reflected in the GRPharoUtf8Codec but different in gemstone. In my own code I like to handle it the way I sketched it. On the one hand I see a large potential for different opinions about the string issue. On the other hand I think that regardless how strings work it should be possible to decode a byte array to a string. If you would tell me that it might work to copy decodeFromUtf8 and _unicodePrim: from String to ByteArray than I would give it a try. I saw already that I need to be the SystemUser in order to do it. Seems like a new excerize :) > > Norbert > |
In reply to this post by NorbertHartl
Norbert,
I think the code is written to work pretty much on Strings, but a variant of the primitive could be written to produce ByteArrays and vice versa, a primitive could be written to take ByteArrays and produce strings afer decoding ... I would imagine that at the very bottom the reason that a UTF8 encoded String is returned as a String is that for a USASCII String, they are identical collections of bytes. The original primitive was written when all of the encoding was done from String to String in the original Smalltalk-base algorithm and in Seaside too (I think). Recently the Seaside folks switched to using ByteArrays and I didn't switch because the primitives were written to operate on and produce Strings. In Pharo I have noticed that ByteArrays are accepted in a bunch of the String primitives when they probably shouldn't be (the primitive failure code is expecting Strings, but the primitive calls allow ByteArrays) which leads to code in Pharo where it doesn't matter whether you have a String or a ByteArray ... GemStone's primitives are more stringent (which is how I've discovered the "loose" Pharo primitives).... In the end I agree that ByteArrays should be used for the encoded Strings.... It looks like primitive 468 requires a String for the decode operation:( Dale On 12/02/2010 08:11 AM, Norbert Hartl wrote: > I'm asking myself at the moment if primitive 468 will also work for > bytes instead of characters. I've taken the method from String > > String>>_unicodePrim: opCode > > "opCode 0 - Encode receiver in UTF8 format. If all characters in > receiver are US-ASCII, answer the receiver. Result is a String. > opCode 1 - Decode receiver from UTF8 format into either a String or > QuadByteString depending upon the range of characters involved. > opCode 2 - Decode receiver from UTF8 format into either a String or > Double/QuadByteString depending upon the range of characters > involved." > > <primitive: 468> self _primitiveFailed: #_unicodePrim: > > As it is going native there isn't a real distinctions between an > 8-bit character and a byte, right?. But I want to have an estimation > before I try to bork my image :) > > The rationale behind it is again utf-8 handling. In gemstone > everything is a string regardless if it is encoded or not. To be > honest I don't think this is a good idea. To get a clear > understanding of the issue it should be possible to separate things. > An easy to follow rule might be that there is > > string -> encoder -> byte array byte array -> decoder -> string > > That behaviour is somewhat reflected in the GRPharoUtf8Codec but > different in gemstone. In my own code I like to handle it the way I > sketched it. On the one hand I see a large potential for different > opinions about the string issue. On the other hand I think that > regardless how strings work it should be possible to decode a byte > array to a string. If you would tell me that it might work to copy > decodeFromUtf8 and _unicodePrim: from String to ByteArray than I > would give it a try. I saw already that I need to be the SystemUser > in order to do it. Seems like a new excerize :) > > Norbert > |
In reply to this post by NorbertHartl
2010/12/2 Norbert Hartl <[hidden email]>:
> I'm asking myself at the moment if primitive 468 will also work for bytes instead of characters. I've taken the method from String > > String>>_unicodePrim: opCode > > "opCode 0 - Encode receiver in UTF8 format. If all characters in receiver > are US-ASCII, answer the receiver. Result is a String. > opCode 1 - Decode receiver from UTF8 format into either a String or > QuadByteString depending upon the range of characters involved. > opCode 2 - Decode receiver from UTF8 format into either a String or > Double/QuadByteString depending upon the range of characters > involved." > > <primitive: 468> > self _primitiveFailed: #_unicodePrim: > > As it is going native there isn't a real distinctions between an 8-bit character and a byte, right?. But I want to have an estimation before I try to bork my image :) > > The rationale behind it is again utf-8 handling. In gemstone everything is a string regardless if it is encoded or not. To be honest I don't think this is a good idea. To get a clear understanding of the issue it should be possible to separate things. An easy to follow rule might be that there is > > string -> encoder -> byte array > byte array -> decoder -> string I agree that this is the long term solution where we want to be but for mostly history reasons that's not how it currently works in Seaside and Pharo. The current situation in Seaside is: string -> encoder -> string string/byte array -> decoder -> string It's on our todo list to change this, but not in the short term. Cheers Philippe |
Free forum by Nabble | Edit this page |