String >> utf8Encoded, ByteArray >> utf8Decoded

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

String >> utf8Encoded, ByteArray >> utf8Decoded

Tony Garnock-Jones-5
Pharo has these two convenience methods:

    String >> utf8Encoded
    ByteArray >> utf8Decoded

It looks like, roughly,

 - Pharo's  "aString utf8Encoded" is equivalent to
   Squeak's "aString squeakToUtf8 asByteArray", and

 - Pharo's  "aByteArray utf8Decoded" is equivalent to
   Squeak's "aByteArray asString utf8ToSqueak".

Should we add definitions like the following to Squeak?

    String >> utf8Encoded
      ^ self squeakToUtf8 asByteArray

    ByteArray >> utf8Decoded
      ^ self asString utf8ToSqueak

?

It takes me an age to find the proper Squeakly incantation every time I
work with network protocols in Squeak. I think these convenience methods
would help.

Tony

Reply | Threaded
Open this post in threaded view
|

Re: String >> utf8Encoded, ByteArray >> utf8Decoded

timrowledge
What would be so much better is a proper UTF8String class.

One of the problems of course is that doing almost anything to a utf encoded pseudostring requires complex faffing around to decode some or all of it. This makes them pretty much useless for anything outside passing to external libraries, at least so far as I have found. However, that turns out to be a quite important thing, and right now we have a horrible mess.

We do, however, have a sorta-kinda model of a way to handle it in FilePath, which actually has nothing much to do with file paths at all. FilePath is a bit more general than a UTF-* string and maybe that is still a valuable option.

I see two basic options for making an improvements

a) a simplistic UTF8String that is a byte array of the utf8 bytes, does nothing much except exist as an encoded string to pass to primitives. #size returns the number of bytes, #at & #at:put: are not for general consumption, to do any sort of editing you have to covert it to a real String.

b) something like FilePath, with both the original and encoded version kept, automagic conversions and some interesting hand-waving to deal with #size (is it the number of characters, or the number of bytes?) etc.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- His future is behind schedule.



Reply | Threaded
Open this post in threaded view
|

Re: String >> utf8Encoded, ByteArray >> utf8Decoded

Tobias Pape
Hi all.

First, I think Tony's idea is good in terms of usability.

Second:

> On 28.01.2018, at 19:06, tim Rowledge <[hidden email]> wrote:
>
> What would be so much better is a proper UTF8String class.
>
> One of the problems of course is that doing almost anything to a utf encoded pseudostring requires complex faffing around to decode some or all of it. This makes them pretty much useless for anything outside passing to external libraries, at least so far as I have found. However, that turns out to be a quite important thing, and right now we have a horrible mess.

I think our ByteString/WideString is already pretty good (We have complete unicode coverage and whatnot). If we have to improve, lets first have a look at the conceptual things.
Please please have a look at:

        https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/

When we have an improved String, it should

 - report size in terms of extended grapheme cluster.
 - Never Ever Expose UTF8 bytes to users
 - Always have UTF8encodings result in ByteArrays.
   (because, well and utf8-encoded thing is no longer a string, it's just encoded byte data)
 - Does normalization correctly…

that's my 2ct
Best regards
        -Tobias

>
> We do, however, have a sorta-kinda model of a way to handle it in FilePath, which actually has nothing much to do with file paths at all. FilePath is a bit more general than a UTF-* string and maybe that is still a valuable option.
>
> I see two basic options for making an improvements
>
> a) a simplistic UTF8String that is a byte array of the utf8 bytes, does nothing much except exist as an encoded string to pass to primitives. #size returns the number of bytes, #at & #at:put: are not for general consumption, to do any sort of editing you have to covert it to a real String.
>
> b) something like FilePath, with both the original and encoded version kept, automagic conversions and some interesting hand-waving to deal with #size (is it the number of characters, or the number of bytes?) etc.
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Useful random insult:- His future is behind schedule.
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: String >> utf8Encoded, ByteArray >> utf8Decoded

Eliot Miranda-2


On Sun, Jan 28, 2018 at 10:40 AM, Tobias Pape <[hidden email]> wrote:
Hi all.

First, I think Tony's idea is good in terms of usability.

+1
 

Second:

> On 28.01.2018, at 19:06, tim Rowledge <[hidden email]> wrote:
>
> What would be so much better is a proper UTF8String class.
>
> One of the problems of course is that doing almost anything to a utf encoded pseudostring requires complex faffing around to decode some or all of it. This makes them pretty much useless for anything outside passing to external libraries, at least so far as I have found. However, that turns out to be a quite important thing, and right now we have a horrible mess.

I think our ByteString/WideString is already pretty good (We have complete unicode coverage and whatnot). If we have to improve, lets first have a look at the conceptual things.
Please please have a look at:

        https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/

When we have an improved String, it should

 - report size in terms of extended grapheme cluster.
 - Never Ever Expose UTF8 bytes to users
 - Always have UTF8encodings result in ByteArrays.
   (because, well and utf8-encoded thing is no longer a string, it's just encoded byte data)
 - Does normalization correctly…

that's my 2ct
Best regards
        -Tobias

>
> We do, however, have a sorta-kinda model of a way to handle it in FilePath, which actually has nothing much to do with file paths at all. FilePath is a bit more general than a UTF-* string and maybe that is still a valuable option.
>
> I see two basic options for making an improvements
>
> a) a simplistic UTF8String that is a byte array of the utf8 bytes, does nothing much except exist as an encoded string to pass to primitives. #size returns the number of bytes, #at & #at:put: are not for general consumption, to do any sort of editing you have to covert it to a real String.
>
> b) something like FilePath, with both the original and encoded version kept, automagic conversions and some interesting hand-waving to deal with #size (is it the number of characters, or the number of bytes?) etc.
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Useful random insult:- His future is behind schedule.
>
>
>





--
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: String >> utf8Encoded, ByteArray >> utf8Decoded

timrowledge
In reply to this post by Tobias Pape


> On 28-01-2018, at 10:40 AM, Tobias Pape <[hidden email]> wrote:
>
>> On 28.01.2018, at 19:06, tim Rowledge <[hidden email]> wrote:
>>
>> What would be so much better is a proper UTF8String class.
>>
>> One of the problems of course is that doing almost anything to a utf encoded pseudostring requires complex faffing around to decode some or all of it. This makes them pretty much useless for anything outside passing to external libraries, at least so far as I have found. However, that turns out to be a quite important thing, and right now we have a horrible mess.
>
> I think our ByteString/WideString is already pretty good

I agree; the mess I was pointing at is the  non-support for knowing that you have a UTF-* encoded string in the bytearray you just got passed. I’ve been bitten by that quite a few times in the NuScratch and MQTT packages for example.

Having a class that tells us the content is a string rendered in utf-8 encoded bytes would be a useful thing, not least because it would make a nice simple way to know that #asString requires converting it that way. Leaving everything as just a ByteArray means we know too little about it to be helpful. Maybe rather than calling such a class ‘UTF8String’, which implies the whole String-ness thing, we should have a UTF8EncodedBytes class to be really clear. One issue with doing anything like this is having to make the VM return the new class instead of a ByteArray - or perhaps make the prim related code use #adoptInstance: a bit like some of the FileStream and SocketStream methods do.

I like your idea of a really well done string system that does all that; I don’t like the amount of work it feels like would be needed. I certainly can’t imagine having time to do it myself. Pretty sure I have less than 200 years to go...

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Oxymorons: "Now, then ..."



Reply | Threaded
Open this post in threaded view
|

Re: String >> utf8Encoded, ByteArray >> utf8Decoded

Eliot Miranda-2
Hi Tim,

> On Jan 28, 2018, at 11:13 AM, tim Rowledge <[hidden email]> wrote:
>
>
>
>>> On 28-01-2018, at 10:40 AM, Tobias Pape <[hidden email]> wrote:
>>>
>>> On 28.01.2018, at 19:06, tim Rowledge <[hidden email]> wrote:
>>>
>>> What would be so much better is a proper UTF8String class.
>>>
>>> One of the problems of course is that doing almost anything to a utf encoded pseudostring requires complex faffing around to decode some or all of it. This makes them pretty much useless for anything outside passing to external libraries, at least so far as I have found. However, that turns out to be a quite important thing, and right now we have a horrible mess.
>>
>> I think our ByteString/WideString is already pretty good
>
> I agree; the mess I was pointing at is the  non-support for knowing that you have a UTF-* encoded string in the bytearray you just got passed. I’ve been bitten by that quite a few times in the NuScratch and MQTT packages for example.
>
> Having a class that tells us the content is a string rendered in utf-8 encoded bytes would be a useful thing, not least because it would make a nice simple way to know that #asString requires converting it that way. Leaving everything as just a ByteArray means we know too little about it to be helpful. Maybe rather than calling such a class ‘UTF8String’, which implies the whole String-ness thing, we should have a UTF8EncodedBytes class to be really clear. One issue with doing anything like this is having to make the VM return the new class instead of a ByteArray - or perhaps make the prim related code use #adoptInstance: a bit like some of the FileStream and SocketStream methods do.

UTF8EncodedBytes is a really good idea.

>
> I like your idea of a really well done string system that does all that; I don’t like the amount of work it feels like would be needed. I certainly can’t imagine having time to do it myself. Pretty sure I have less than 200 years to go...
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Oxymorons: "Now, then ..."
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: String >> utf8Encoded, ByteArray >> utf8Decoded

David T. Lewis
In reply to this post by Eliot Miranda-2
On Sun, Jan 28, 2018 at 10:52:04AM -0800, Eliot Miranda wrote:
> On Sun, Jan 28, 2018 at 10:40 AM, Tobias Pape <[hidden email]> wrote:
>
> > Hi all.
> >
> > First, I think Tony's idea is good in terms of usability.
> >
>
> +1
>

+1

Sounds like a good idea to me too. Tony, if you are working with an up
to date image, please save those two methods to the inbox and I (or
someone else) will move it to trunk.

Thanks!
Dave

Reply | Threaded
Open this post in threaded view
|

Re: String >> utf8Encoded, ByteArray >> utf8Decoded

timrowledge
In reply to this post by Eliot Miranda-2


> On 28-01-2018, at 12:56 PM, Eliot Miranda <[hidden email]> wrote:
>
> UTF8EncodedBytes is a really good idea.

Yup - but there’s no chance of me getting to do anything about it right now. There’s so much code that anticipates getting back byte arrays (or sometimes, byte strings. Or maybe something else) and then does odd looking stuff with it … I’m trying to clean out the project related dispatch stuff and finding so very much nasty stuff underneath it that I’m mentally swamped.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- One flower short of an arrangement.