Recent change in byte array at:put:

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Recent change in byte array at:put:

timrowledge
We recently changed ByteArray>at:put: to remove the backup conversion of the value to integer, for what seemed like decent reasons.

It’s broken my WeatherStation code a little because there are places where I use
{my byte stream} nextPutAll: (aString squeakToUtf8)
or similar. #squeakToUtf8 returns a bytestring, and of course when the #nextPutAll: loop does its thing each character is pulled out as a Character (even though we know at this point it’s a byte value) - and we’ve just made it impossible to stick a character into a byte array.

Clearly I could fix it reasonably trivially with a few #asByteArray type messages scattered around but it feels a bit tacky somehow. I see some faintly similar code with plausibly similar issues in WebSocket classes too, which would need some care. Not that I can see a lot of usage of that code…

Performance isn’t a colossal issue for MQTT packets but it just rankles a bit to have a known byte valued string and then have to convert it to write it into a byte valued stream collection. KnowWhadIMean?

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Never write software that anthropomorphizes the machine. They hate that.



Reply | Threaded
Open this post in threaded view
|

Re: Recent change in byte array at:put:

Tobias Pape

> On 29.07.2017, at 21:15, tim Rowledge <[hidden email]> wrote:
>
> We recently changed ByteArray>at:put: to remove the backup conversion of the value to integer, for what seemed like decent reasons.
>
> It’s broken my WeatherStation code a little because there are places where I use
> {my byte stream} nextPutAll: (aString squeakToUtf8)
> or similar. #squeakToUtf8 returns a bytestring, and of course when the #nextPutAll: loop does its thing each character is pulled out as a Character (even though we know at this point it’s a byte value) - and we’ve just made it impossible to stick a character into a byte array.
>
> Clearly I could fix it reasonably trivially with a few #asByteArray type messages scattered around but it feels a bit tacky somehow. I see some faintly similar code with plausibly similar issues in WebSocket classes too, which would need some care. Not that I can see a lot of usage of that code…
>
> Performance isn’t a colossal issue for MQTT packets but it just rankles a bit to have a known byte valued string and then have to convert it to write it into a byte valued stream collection. KnowWhadIMean?
>

Underlying questions:
- does an utf8 encoded string contain unicode-valued characters?
-> is an utf8-encoded string a string after all?

I'd suggest no out of purity but I'll expect yes from other out of practicality.

Best regards
        -Tobias

> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Never write software that anthropomorphizes the machine. They hate that.
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Recent change in byte array at:put:

Nicolas Cellier


2017-07-29 21:31 GMT+02:00 Tobias Pape <[hidden email]>:

> On 29.07.2017, at 21:15, tim Rowledge <[hidden email]> wrote:
>
> We recently changed ByteArray>at:put: to remove the backup conversion of the value to integer, for what seemed like decent reasons.
>
> It’s broken my WeatherStation code a little because there are places where I use
> {my byte stream} nextPutAll: (aString squeakToUtf8)
> or similar. #squeakToUtf8 returns a bytestring, and of course when the #nextPutAll: loop does its thing each character is pulled out as a Character (even though we know at this point it’s a byte value) - and we’ve just made it impossible to stick a character into a byte array.
>
> Clearly I could fix it reasonably trivially with a few #asByteArray type messages scattered around but it feels a bit tacky somehow. I see some faintly similar code with plausibly similar issues in WebSocket classes too, which would need some care. Not that I can see a lot of usage of that code…
>
> Performance isn’t a colossal issue for MQTT packets but it just rankles a bit to have a known byte valued string and then have to convert it to write it into a byte valued stream collection. KnowWhadIMean?
>

Underlying questions:
- does an utf8 encoded string contain unicode-valued characters?
-> is an utf8-encoded string a string after all?

I'd suggest no out of purity but I'll expect yes from other out of practicality.

Best regards
        -Tobias


Absolutely,
to me a String is a sequence of characters.
squeakToUtf8 is a hack that makes us consider a String as a sequence of codePoints whose encoding is in the eye of the beholder (or implicitly in the Context - the Smalltalk one).
I's not very object oriented and quite fragile.
We started to clean Multilingual but never finished the job...

It's difficult to finish it, because we value backward compatibility.
So maybe the ByteArray change was a bit radical with this respect.

Nicolas

> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Never write software that anthropomorphizes the machine. They hate that.
>
>
>





Reply | Threaded
Open this post in threaded view
|

Re: Recent change in byte array at:put:

timrowledge

> On 29-07-2017, at 12:48 PM, Nicolas Cellier <[hidden email]> wrote:
> Absolutely,
> to me a String is a sequence of characters.
> squeakToUtf8 is a hack that makes us consider a String as a sequence of codePoints whose encoding is in the eye of the beholder (or implicitly in the Context - the Smalltalk one).
> I's not very object oriented and quite fragile.
> We started to clean Multilingual but never finished the job…

Yes, that’s pretty much how I see it. Currently the utf8 ‘string’ is just kept as a byte string and the user is expected to understand that it is in a rather dangerous state.

>
> It's difficult to finish it, because we value backward compatibility.
> So maybe the ByteArray change was a bit radical with this respect.

Backward compatibility can sometimes drive you to loud swearing!

Maybe a new message to return the bytearray of the uft8 data could be added, leaving the old one alone. We should probably consider making an actual UTF8String class, though I did try to work out the best thing to do for that several years ago for NuScratch and got lost in the tangles. Editing the damn things is a pain, to say the leat, so you get to thinking about having the canonical string as an instvar and a byte array and edits work on the String which gets converted at the end of the edit to update bytearray. Or the other way around… or… aaargh!


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- Immune from any serious head injury.



Reply | Threaded
Open this post in threaded view
|

Re: Recent change in byte array at:put:

Jakob Reschke-2
In reply to this post by Nicolas Cellier
Simple "solution" to editing: treat (encoded) Strings as immutable.

For editing/stringbuilding, use a WideString or a special kind of stream (MultiByteBinaryOrTextStream or how is it called) plus additional support for inserting in the middle if desired. I remember somebody proposing Ropes when discussing a reformation of strings previously. Could be interesting at "edit-time".

For (in-memory) storage, encoded Strings should maybe just be ByteArrays paired with some TextConverter-like thing or at least a spec of the encoding so you can fetch Characters or configure streams from it on demand.

If licensing permits it one could also have a look at how the OpenJDK deals with UTF-16 in StringBuilder.


Am 30.07.2017 03:20 schrieb "tim Rowledge" <[hidden email]>:

> On 29-07-2017, at 12:48 PM, Nicolas Cellier <[hidden email]> wrote:
> Absolutely,
> to me a String is a sequence of characters.
> squeakToUtf8 is a hack that makes us consider a String as a sequence of codePoints whose encoding is in the eye of the beholder (or implicitly in the Context - the Smalltalk one).
> I's not very object oriented and quite fragile.
> We started to clean Multilingual but never finished the job…

Yes, that’s pretty much how I see it. Currently the utf8 ‘string’ is just kept as a byte string and the user is expected to understand that it is in a rather dangerous state.

>
> It's difficult to finish it, because we value backward compatibility.
> So maybe the ByteArray change was a bit radical with this respect.

Backward compatibility can sometimes drive you to loud swearing!

Maybe a new message to return the bytearray of the uft8 data could be added, leaving the old one alone. We should probably consider making an actual UTF8String class, though I did try to work out the best thing to do for that several years ago for NuScratch and got lost in the tangles. Editing the damn things is a pain, to say the leat, so you get to thinking about having the canonical string as an instvar and a byte array and edits work on the String which gets converted at the end of the edit to update bytearray. Or the other way around… or… aaargh!
Useful random insult:- Immune from any serious head injury.






Reply | Threaded
Open this post in threaded view
|

Re: Recent change in byte array at:put:

Tobias Pape

> On 30.07.2017, at 11:02, Jakob Reschke <[hidden email]> wrote:
>
> Simple "solution" to editing: treat (encoded) Strings as immutable.
>
> For editing/stringbuilding, use a WideString or a special kind of stream (MultiByteBinaryOrTextStream or how is it called) plus additional support for inserting in the middle if desired. I remember somebody proposing Ropes when discussing a reformation of strings previously. Could be interesting at "edit-time".
>
> For (in-memory) storage, encoded Strings should maybe just be ByteArrays paired with some TextConverter-like thing or at least a spec of the encoding so you can fetch Characters or configure streams from it on demand.
>
> If licensing permits it one could also have a look at how the OpenJDK deals with UTF-16 in StringBuilder.
>

While I agree in principle, don't come near me with utf16 ;)

>
> Am 30.07.2017 03:20 schrieb "tim Rowledge" <[hidden email]>:
>
> > On 29-07-2017, at 12:48 PM, Nicolas Cellier <[hidden email]> wrote:
> > Absolutely,
> > to me a String is a sequence of characters.
> > squeakToUtf8 is a hack that makes us consider a String as a sequence of codePoints whose encoding is in the eye of the beholder (or implicitly in the Context - the Smalltalk one).
> > I's not very object oriented and quite fragile.
> > We started to clean Multilingual but never finished the job…
>
> Yes, that’s pretty much how I see it. Currently the utf8 ‘string’ is just kept as a byte string and the user is expected to understand that it is in a rather dangerous state.
>
> >
> > It's difficult to finish it, because we value backward compatibility.
> > So maybe the ByteArray change was a bit radical with this respect.
>
> Backward compatibility can sometimes drive you to loud swearing!
>
> Maybe a new message to return the bytearray of the uft8 data could be added, leaving the old one alone. We should probably consider making an actual UTF8String class, though I did try to work out the best thing to do for that several years ago for NuScratch and got lost in the tangles. Editing the damn things is a pain, to say the leat, so you get to thinking about having the canonical string as an instvar and a byte array and edits work on the String which gets converted at the end of the edit to update bytearray. Or the other way around… or… aaargh!
>
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Useful random insult:- Immune from any serious head injury.
>
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Recent change in byte array at:put:

Jakob Reschke-2
In reply to this post by Jakob Reschke-2
I did not want to advertise for UTF-16, but I thought maybe something
could be learned from the Java implementation of string building. ;-)
But now that I looked up about UTF-16 and its surrogates again, and
found this [1], I doubt it.

[1] https://stackoverflow.com/questions/26170180/complexity-of-insert0-c-operation-on-stringbuffer-is-it-o1


2017-07-30 15:34 GMT+02:00 Tobias Pape <[hidden email]>:

>
>> On 30.07.2017, at 11:02, Jakob Reschke <[hidden email]> wrote:
>>
>> Simple "solution" to editing: treat (encoded) Strings as immutable.
>>
>> For editing/stringbuilding, use a WideString or a special kind of stream (MultiByteBinaryOrTextStream or how is it called) plus additional support for inserting in the middle if desired. I remember somebody proposing Ropes when discussing a reformation of strings previously. Could be interesting at "edit-time".
>>
>> For (in-memory) storage, encoded Strings should maybe just be ByteArrays paired with some TextConverter-like thing or at least a spec of the encoding so you can fetch Characters or configure streams from it on demand.
>>
>> If licensing permits it one could also have a look at how the OpenJDK deals with UTF-16 in StringBuilder.
>>
>
> While I agree in principle, don't come near me with utf16 ;)
>
>>
>> Am 30.07.2017 03:20 schrieb "tim Rowledge" <[hidden email]>:
>>
>> > On 29-07-2017, at 12:48 PM, Nicolas Cellier <[hidden email]> wrote:
>> > Absolutely,
>> > to me a String is a sequence of characters.
>> > squeakToUtf8 is a hack that makes us consider a String as a sequence of codePoints whose encoding is in the eye of the beholder (or implicitly in the Context - the Smalltalk one).
>> > I's not very object oriented and quite fragile.
>> > We started to clean Multilingual but never finished the job…
>>
>> Yes, that’s pretty much how I see it. Currently the utf8 ‘string’ is just kept as a byte string and the user is expected to understand that it is in a rather dangerous state.
>>
>> >
>> > It's difficult to finish it, because we value backward compatibility.
>> > So maybe the ByteArray change was a bit radical with this respect.
>>
>> Backward compatibility can sometimes drive you to loud swearing!
>>
>> Maybe a new message to return the bytearray of the uft8 data could be added, leaving the old one alone. We should probably consider making an actual UTF8String class, though I did try to work out the best thing to do for that several years ago for NuScratch and got lost in the tangles. Editing the damn things is a pain, to say the leat, so you get to thinking about having the canonical string as an instvar and a byte array and edits work on the String which gets converted at the end of the edit to update bytearray. Or the other way around… or… aaargh!
>>
>>
>> tim
>> --
>> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
>> Useful random insult:- Immune from any serious head injury.
>>
>>
>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Recent change in byte array at:put:

timrowledge
In reply to this post by Tobias Pape

> On 30-07-2017, at 6:34 AM, Tobias Pape <[hidden email]> wrote:
>
>
>> On 30.07.2017, at 11:02, Jakob Reschke <[hidden email]> wrote:
>>
>> If licensing permits it one could also have a look at how the OpenJDK deals with UTF-16 in StringBuilder.
>>
>
> While I agree in principle, don't come near me with utf16 ;)

I was about to say something similar :-)

I think it’s reasonably clear that nobody wants to have UFT-X as the main representation of text within their system if any sort of editing might be involved. It’s just too painful. However, there seem to be quite a lot of places where UTF-8 has been chosen as a sort of interface coding, I imagine for some sort of space-saving reasons in general. It does seem like a bit of an early -90’s “oh my gosh, all the furrin letters take up so much space what can we do, we can’t ask people to install an entire megabyte of memory on their PCs!” thing.

For the NuScratch stuff I used Cairo/Pango to render text nicely and thus had to convert everything to UTF-8 in order to pass it to the renderer. No editing was done to any of that, so no backward conversions or complex parsing required. To my surprise the general performance on the Pi’s was not noticeably impacted; when I did my first experiments I though I would have to render the full fonts out to make my own glyph bitmaps and so on but in fact it worked nicely. Which meant that the languages with complex layout and kerning rules could be dealt with by somebody else’s code, which I like.

Jakob mentioned pairing encoded bytes with convertors of some kind and that made me think of Text, where we pretty much do that already. I wonder if using a runarray paired with the bytearray of UTF-8 (or even, dog help us, UTF-16) to call out where non-byte characters lurk would work? Think about behaving as if the text attribute were ‘this one needs 3 bytes’ rather than ‘this one is in flashing red sparkles with rotating underlines and winking quotes”. Given that we are able to handle editing Text pretty well, maybe, just maybe, that would make editing UFT-X work decently? Sounds like a good student project to me ;-)

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- Calls people to ask them their phone number.



Reply | Threaded
Open this post in threaded view
|

Re: Recent change in byte array at:put:

Bert Freudenberg
In reply to this post by timrowledge
On Sat 29. Jul 2017 at 21:16, tim Rowledge <[hidden email]> wrote:
We recently changed ByteArray>at:put: to remove the backup conversion of the value to integer, for what seemed like decent reasons.

It’s broken my WeatherStation code a little because there are places where I use
{my byte stream} nextPutAll: (aString squeakToUtf8)
or similar. #squeakToUtf8 returns a bytestring, and of course when the #nextPutAll: loop does its thing each character is pulled out as a Character (even though we know at this point it’s a byte value) - and we’ve just made it impossible to stick a character into a byte array.

Clearly I could fix it reasonably trivially with a few #asByteArray type messages scattered around but it feels a bit tacky somehow. I see some faintly similar code with plausibly similar issues in WebSocket classes too, which would need some care. Not that I can see a lot of usage of that code…

Performance isn’t a colossal issue for MQTT packets but it just rankles a bit to have a known byte valued string and then have to convert it to write it into a byte valued stream collection. KnowWhadIMean?

It only worked accidentally - you can't put text in a binary stream. So either use a string stream or you put a byte array. 

It may make sense to create a #asUtf8Bytes method ... or maybe a #nextPutAllUtf8: which could avoid the extra copy.

- Bert -