The Trunk: TrueType-nice.13.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

The Trunk: TrueType-nice.13.mcz

commits-2
Nicolas Cellier uploaded a new version of TrueType to project The Trunk:
http://source.squeak.org/trunk/TrueType-nice.13.mcz

==================== Summary ====================

Name: TrueType-nice.13
Author: nice
Time: 18 January 2010, 6:18:47.592 pm
UUID: 9c0ae3bf-8eee-434b-bb69-a4cf0bda202a
Ancestors: TrueType-nice.12

Use ByteArray literals

=============== Diff against TrueType-nice.12 ===============

Item was changed:
  ----- Method: TTFileDescription class>>fontOffsetsInFile: (in category 'instance creation') -----
  fontOffsetsInFile: file
  "Answer a collection of font offsets in the given file"
  | tag version nFonts |
  file position: 0.
  tag := file next: 4.
  tag caseOf:{
  ['true' asByteArray] -> ["Version 1.0 TTF file"
  "http://developer.apple.com/textfonts/TTRefMan/RM06/Chap6.html
  The values 'true' (0x74727565) and 0x00010000 are recognized by the Mac OS
  as referring to TrueType fonts."
  ^Array with: 0 "only one font"
  ].
+ [#[0 1 0 0]] -> ["Version 1.0 TTF file"
- [#(0 1 0 0) asByteArray] -> ["Version 1.0 TTF file"
  ^Array with: 0 "only one font"
  ].
  ['ttcf' asByteArray] -> ["TTC file"
  version := file next: 4.
  version = #(0 1 0 0) asByteArray ifFalse:[^self error: 'Unsupported TTC version'].
  nFonts := file nextNumber: 4.
  ^(1 to: nFonts) collect:[:i| file nextNumber: 4].
  ].
  } otherwise:[
  self error: 'This is not a valid Truetype file'.
  ].!


Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Igor Stasenko
2010/1/18  <[hidden email]>:

> Nicolas Cellier uploaded a new version of TrueType to project The Trunk:
> http://source.squeak.org/trunk/TrueType-nice.13.mcz
>
> ==================== Summary ====================
>
> Name: TrueType-nice.13
> Author: nice
> Time: 18 January 2010, 6:18:47.592 pm
> UUID: 9c0ae3bf-8eee-434b-bb69-a4cf0bda202a
> Ancestors: TrueType-nice.12
>
> Use ByteArray literals
>

Woohooo, ByteArray literals!!!


--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Nicolas Cellier
2010/1/18 Igor Stasenko <[hidden email]>:

> 2010/1/18  <[hidden email]>:
>> Nicolas Cellier uploaded a new version of TrueType to project The Trunk:
>> http://source.squeak.org/trunk/TrueType-nice.13.mcz
>>
>> ==================== Summary ====================
>>
>> Name: TrueType-nice.13
>> Author: nice
>> Time: 18 January 2010, 6:18:47.592 pm
>> UUID: 9c0ae3bf-8eee-434b-bb69-a4cf0bda202a
>> Ancestors: TrueType-nice.12
>>
>> Use ByteArray literals
>>
>
> Woohooo, ByteArray literals!!!
>

BTW, I see much code like:
   'true' asByteArray
Not that it costs that much CPU, but I'd like to make it a literal...
However, I'd like to preserve the initializing code and semantics.
This solution is not really satisfying:
  "'true' asByteArray" #[116 114 117 101]

Dolphinners would write some initializer like:
##( 'true' asByteArray )

Travis Griggs would use a block and a tricky superpower in VW (can't
find the blog right now).

How do you see it?

Nicolas

>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Igor Stasenko
2010/1/18 Nicolas Cellier <[hidden email]>:

> 2010/1/18 Igor Stasenko <[hidden email]>:
>> 2010/1/18  <[hidden email]>:
>>> Nicolas Cellier uploaded a new version of TrueType to project The Trunk:
>>> http://source.squeak.org/trunk/TrueType-nice.13.mcz
>>>
>>> ==================== Summary ====================
>>>
>>> Name: TrueType-nice.13
>>> Author: nice
>>> Time: 18 January 2010, 6:18:47.592 pm
>>> UUID: 9c0ae3bf-8eee-434b-bb69-a4cf0bda202a
>>> Ancestors: TrueType-nice.12
>>>
>>> Use ByteArray literals
>>>
>>
>> Woohooo, ByteArray literals!!!
>>
>
> BTW, I see much code like:
>   'true' asByteArray
> Not that it costs that much CPU, but I'd like to make it a literal...
> However, I'd like to preserve the initializing code and semantics.
> This solution is not really satisfying:
>  "'true' asByteArray" #[116 114 117 101]
>
> Dolphinners would write some initializer like:
> ##( 'true' asByteArray )
>
> Travis Griggs would use a block and a tricky superpower in VW (can't
> find the blog right now).
>
> How do you see it?
>
Perhaps:
#['true']

the check that if first literal is string (not a number), then it
should be the only literal within []

> Nicolas
>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>>
>>
>
>



--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Michael Haupt-3
Hi,

Am 18.01.2010 um 19:47 schrieb Igor Stasenko <[hidden email]>:

>> Dolphinners would write some initializer like:
>> ##( 'true' asByteArray )
>>
>> ...
>> How do you see it?
>>
> Perhaps:
> #['true']
>
> the check that if first literal is string (not a number), then it
> should be the only literal within []

I don't know Dolphin, but the construct looks like a macro (correct?).  
Now, macros are way cool. At least in my opinion. Who says we have to  
stick to ST-80 forever?

Actually, I'd expect there to be a macro package somewhere out there.  
Does anyone know?

Best,

Michael

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Igor Stasenko
2010/1/18 Michael Haupt <[hidden email]>:

> Hi,
>
> Am 18.01.2010 um 19:47 schrieb Igor Stasenko <[hidden email]>:
>>>
>>> Dolphinners would write some initializer like:
>>> ##( 'true' asByteArray )
>>>
>>> ...
>>> How do you see it?
>>>
>> Perhaps:
>> #['true']
>>
>> the check that if first literal is string (not a number), then it
>> should be the only literal within []
>
> I don't know Dolphin, but the construct looks like a macro (correct?). Now,
> macros are way cool. At least in my opinion. Who says we have to stick to
> ST-80 forever?
>
Yeah, just don't forget about decompiler.

> Actually, I'd expect there to be a macro package somewhere out there. Does
> anyone know?
>

> Best,
>
> Michael
>
>



--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Michael Haupt-3
Hi Igor,

Am 18.01.2010 um 22:19 schrieb Igor Stasenko <[hidden email]>:
> Yeah, just don't forget about decompiler.

sure. How does Dolphin deal with this?

Best,

Michael
>

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Colin Putney
In reply to this post by Nicolas Cellier

On 2010-01-18, at 10:34 AM, Nicolas Cellier wrote:

>
> BTW, I see much code like:
>   'true' asByteArray
> Not that it costs that much CPU, but I'd like to make it a literal...
> However, I'd like to preserve the initializing code and semantics.
> This solution is not really satisfying:
>  "'true' asByteArray" #[116 114 117 101]

I prefer this option, actually. 'true' asByteArray is bogus to begin with. What encoding are those bytes? ASCII? Latin-1? UTF8? Do the bytes that are produced depend on the internal encoding of String? Or maybe the encoding of the changes file. Sure, all these questions have answers, but they are not obvious from the source code. At least #[116 114 117 101] is unambiguous.

Colin
Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Nicolas Cellier
2010/1/19 Colin Putney <[hidden email]>:

>
> On 2010-01-18, at 10:34 AM, Nicolas Cellier wrote:
>
>>
>> BTW, I see much code like:
>>   'true' asByteArray
>> Not that it costs that much CPU, but I'd like to make it a literal...
>> However, I'd like to preserve the initializing code and semantics.
>> This solution is not really satisfying:
>>  "'true' asByteArray" #[116 114 117 101]
>
> I prefer this option, actually. 'true' asByteArray is bogus to begin with. What encoding are those bytes? ASCII? Latin-1? UTF8? Do the bytes that are produced depend on the internal encoding of String? Or maybe the encoding of the changes file. Sure, all these questions have answers, but they are not obvious from the source code. At least #[116 114 117 101] is unambiguous.
>
> Colin
>

Seems to be a header for true type fonts

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Andreas.Raab
In reply to this post by Colin Putney
Colin Putney wrote:

> On 2010-01-18, at 10:34 AM, Nicolas Cellier wrote:
>
>> BTW, I see much code like:
>>   'true' asByteArray
>> Not that it costs that much CPU, but I'd like to make it a literal...
>> However, I'd like to preserve the initializing code and semantics.
>> This solution is not really satisfying:
>>  "'true' asByteArray" #[116 114 117 101]
>
> I prefer this option, actually. 'true' asByteArray is bogus to begin with.

It's not. Check the comment:

        ['true' asByteArray] -> ["Version 1.0 TTF file"
                "http://developer.apple.com/textfonts/TTRefMan/RM06/Chap6.html
                The values 'true' (0x74727565) and 0x00010000 are recognized by the
Mac OS
                as referring to TrueType fonts."
                ^Array with: 0 "only one font"
        ].

Since this code has no performance implication whatsoever, I find 'true'
asByteArray vastly more readable and intention revealing than  #[116 114
117 101].

Cheers,
   - Andreas


Reply | Threaded
Open this post in threaded view
|

Re: Re: The Trunk: TrueType-nice.13.mcz

Colin Putney

On 2010-01-18, at 8:23 PM, Andreas Raab wrote:

> It's not. Check the comment:
>
> ['true' asByteArray] -> ["Version 1.0 TTF file"
> "http://developer.apple.com/textfonts/TTRefMan/RM06/Chap6.html
> The values 'true' (0x74727565) and 0x00010000 are recognized by the Mac OS
> as referring to TrueType fonts."
> ^Array with: 0 "only one font"
> ].
>
> Since this code has no performance implication whatsoever, I find 'true' asByteArray vastly more readable and intention revealing than  #[116 114 117 101].

It's more intention revealing, yes. What I don't like is that it relies on the unspecified semantics of #asByteArray. If the implementation of String changes (and it's happened once in recent history, so it's not just a "theoretical" worry), then you may not get the bytes you were expecting.

The code above specifies the intention, but not the actual bytes required. The exact byte sequence is important though, enough so that the author felt the need to specify it in the comment. I'd rather see the exact bytes in the code, and the intention in the comment.

Colin
Reply | Threaded
Open this post in threaded view
|

Re: Re: The Trunk: TrueType-nice.13.mcz

Eliot Miranda-2


On Mon, Jan 18, 2010 at 8:59 PM, Colin Putney <[hidden email]> wrote:

On 2010-01-18, at 8:23 PM, Andreas Raab wrote:

> It's not. Check the comment:
>
>       ['true' asByteArray] -> ["Version 1.0 TTF file"
>               "http://developer.apple.com/textfonts/TTRefMan/RM06/Chap6.html
>               The values 'true' (0x74727565) and 0x00010000 are recognized by the Mac OS
>               as referring to TrueType fonts."
>               ^Array with: 0 "only one font"
>       ].
>
> Since this code has no performance implication whatsoever, I find 'true' asByteArray vastly more readable and intention revealing than  #[116 114 117 101].

It's more intention revealing, yes. What I don't like is that it relies on the unspecified semantics of #asByteArray. If the implementation of String changes (and it's happened once in recent history, so it's not just a "theoretical" worry), then you may not get the bytes you were expecting.

The code above specifies the intention, but not the actual bytes required. The exact byte sequence is important though, enough so that the author felt the need to specify it in the comment. I'd rather see the exact bytes in the code, and the intention in the comment.

Colin

I agree, and hence why not:

       [#[ 16r74 "t" 16r72 "r" 16r75 "u" 16r65 "e" ] -> ["Version 1.0 TTF file"
               "http://developer.apple.com/textfonts/TTRefMan/RM06/Chap6.html
               The values 'true' (16r74727565 in ascii) and 16r00010000 are recognized by the Mac OS
               as referring to TrueType fonts."
               ^Array with: 0 "only one font"
       ].

?  Use of 16r is to be preferred to 0x because this is a comment to be read by Smalltalk programmers.


Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Andreas.Raab
Eliot Miranda wrote:

> I agree, and hence why not:
>
>        [#[ 16r74 "t" 16r72 "r" 16r75 "u" 16r65 "e" ] -> ["Version 1.0
> TTF file"
>              
>  "http://developer.apple.com/textfonts/TTRefMan/RM06/Chap6.html
>                The values 'true' (16r74727565 in ascii) and 16r00010000
> are recognized by the Mac OS
>                as referring to TrueType fonts."
>                ^Array with: 0 "only one font"
>        ].
>
> ?  Use of 16r is to be preferred to 0x because this is a comment to be
> read by Smalltalk programmers.

Using #[ 16r74 "t" 16r72 "r" 16r75 "u" 16r65 "e" ] is even worse than
the alternatives because now you can't even see if what you wrote is
what's expected to be there. Shudder. You *really* want to read and
debug code that looks like that? I most definitely don't.

Regarding the "oh, this would be a problem if the string encoding
changes", let's keep in mind that the string encoding *hasn't* changed
for ASCII. So, no, it's specifically incorrect to say that the past
change would have affected or invalidated that code. I don't think you
understand how fundamental the impact of an encoding change is - if you
did that *all* string literals would look like gobbly gook. The only
reason we could do it for non-ascii was that we're not using any
non-ascii characters and we were only switching the characters > 127
from Mac Roman to Unicode/Latin-1. If you don't believe me, grab your
favorite non-ascii piece of text and throw it at "yourText squeakToMac"
and have a look at it.

So the argument that the dependency on literal string encoding is an
issue is bogus. If you change literal string encoding there are so many
other places that break it's not even funny. And at least I don't design
for the implausible (what reason do we have to expect the encoding to
change in the next fifty years?)

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Colin Putney

On 2010-01-19, at 10:04 AM, Andreas Raab wrote:

> Regarding the "oh, this would be a problem if the string encoding changes", let's keep in mind that the string encoding *hasn't* changed for ASCII. So, no, it's specifically incorrect to say that the past change would have affected or invalidated that code. I don't think you understand how fundamental the impact of an encoding change is - if you did that *all* string literals would look like gobbly gook. The only reason we could do it for non-ascii was that we're not using any non-ascii characters and we were only switching the characters > 127 from Mac Roman to Unicode/Latin-1. If you don't believe me, grab your favorite non-ascii piece of text and throw it at "yourText squeakToMac" and have a look at it.
>
> So the argument that the dependency on literal string encoding is an issue is bogus. If you change literal string encoding there are so many other places that break it's not even funny. And at least I don't design for the implausible (what reason do we have to expect the encoding to change in the next fifty years?)


Here's another way to think about it: #asByteArray violates the abstraction provided by String. A string is a sequence of characters, right? How that sequence is represented in memory is internal to the implementation, and should not be relied upon by users of the string. Historically this abstraction has been very leaky in Squeak, I suppose because of performance issues. (VW does a much better job of this - to convert strings to bytes, you have to specify an encoding, and when you re-encode a string you get a ByteArray and not another string. Immediate characters are a win.)

As a result, Squeak is rife with code that assumes a particular encoding and treats strings as byte sequences rather than character sequences. That doesn't mean this instance is a good idea, though, it just means we have a lot of bad code floating around. Consider: if we *had* a tighter abstraction in String, switching the encoding would be much easier. Not trivial, because of string literals, source code encodings etc, but easier.

BTW, I'm pretty familiar with issues arising from the encoding of Strings. I had code break under Squeak 3.8 because of m17n, and have long wrestled with encoding issues in web apps - strings go over the network in UTF-8, so either they get transcoded to MacRoman/Latin-1 inside the image and transcoded back to UTF-8 on the way out again, or you leave them in UTF-8 and live with the fact that you can't manipulate them inside the image.

This particular case doesn't matter that much, but in general I'd like to see us moving in the direction of cleaner abstractions. Sure, there's lots of legacy code out there that makes assumptions about the internal structure of strings, but that's not an excuse to write more of it.

Colin
Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

johnmci

On 2010-01-20, at 6:55 PM, Colin Putney wrote:

>
> This particular case doesn't matter that much, but in general I'd like to see us moving in the direction of cleaner abstractions. Sure, there's lots of legacy code out there that makes assumptions about the internal structure of strings, but that's not an excuse to write more of it.
>
> Colin


If I recall correctly about a conversaion I had in  talking to David Simmons of S# fame  & QKS frame  http://www.linkedin.com/pub/david-simmons/4/474/789
back in the early part of the last decade he said that String objects were unicode based and you didn't really mess with them think of them as byte arrays,
a slot wasn't a byte it was an object.

--
===========================================================================
John M. McIntosh <[hidden email]>   Twitter:  squeaker68882
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
===========================================================================





Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Nicolas Cellier
In reply to this post by Colin Putney
2010/1/21 Colin Putney <[hidden email]>:

>
> On 2010-01-19, at 10:04 AM, Andreas Raab wrote:
>
>> Regarding the "oh, this would be a problem if the string encoding changes", let's keep in mind that the string encoding *hasn't* changed for ASCII. So, no, it's specifically incorrect to say that the past change would have affected or invalidated that code. I don't think you understand how fundamental the impact of an encoding change is - if you did that *all* string literals would look like gobbly gook. The only reason we could do it for non-ascii was that we're not using any non-ascii characters and we were only switching the characters > 127 from Mac Roman to Unicode/Latin-1. If you don't believe me, grab your favorite non-ascii piece of text and throw it at "yourText squeakToMac" and have a look at it.
>>
>> So the argument that the dependency on literal string encoding is an issue is bogus. If you change literal string encoding there are so many other places that break it's not even funny. And at least I don't design for the implausible (what reason do we have to expect the encoding to change in the next fifty years?)
>
>
> Here's another way to think about it: #asByteArray violates the abstraction provided by String. A string is a sequence of characters, right? How that sequence is represented in memory is internal to the implementation, and should not be relied upon by users of the string. Historically this abstraction has been very leaky in Squeak, I suppose because of performance issues. (VW does a much better job of this - to convert strings to bytes, you have to specify an encoding, and when you re-encode a string you get a ByteArray and not another string. Immediate characters are a win.)
>
> As a result, Squeak is rife with code that assumes a particular encoding and treats strings as byte sequences rather than character sequences. That doesn't mean this instance is a good idea, though, it just means we have a lot of bad code floating around. Consider: if we *had* a tighter abstraction in String, switching the encoding would be much easier. Not trivial, because of string literals, source code encodings etc, but easier.
>
> BTW, I'm pretty familiar with issues arising from the encoding of Strings. I had code break under Squeak 3.8 because of m17n, and have long wrestled with encoding issues in web apps - strings go over the network in UTF-8, so either they get transcoded to MacRoman/Latin-1 inside the image and transcoded back to UTF-8 on the way out again, or you leave them in UTF-8 and live with the fact that you can't manipulate them inside the image.
>
> This particular case doesn't matter that much, but in general I'd like to see us moving in the direction of cleaner abstractions. Sure, there's lots of legacy code out there that makes assumptions about the internal structure of strings, but that's not an excuse to write more of it.
>
> Colin
>

I'm not convinced.
It's like saying that the fact that a true-type file begins with the
de-facto universal encoding of 'true' is purely coincidental...
IMO, it's the same expression of 'true' asByteArray and never was a
series of bits 16r74727565 in the mind of whoever conceived the
pattern.

Nicolas

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Josh Gargus
In reply to this post by Colin Putney
Here's a few ideas.  The first is to extend the new literal ByteArray syntax to allow #[true].  Or maybe #['true'], so that we can mix characters and integers: #['all' 4 1 'and' 1 4 'all].

(that one was supposed to be a joke... more serious thoughts are inline below)


On Jan 20, 2010, at 6:55 PM, Colin Putney wrote:

>
> On 2010-01-19, at 10:04 AM, Andreas Raab wrote:
>
>> Regarding the "oh, this would be a problem if the string encoding changes", let's keep in mind that the string encoding *hasn't* changed for ASCII. So, no, it's specifically incorrect to say that the past change would have affected or invalidated that code. I don't think you understand how fundamental the impact of an encoding change is - if you did that *all* string literals would look like gobbly gook. The only reason we could do it for non-ascii was that we're not using any non-ascii characters and we were only switching the characters > 127 from Mac Roman to Unicode/Latin-1. If you don't believe me, grab your favorite non-ascii piece of text and throw it at "yourText squeakToMac" and have a look at it.
>>
>> So the argument that the dependency on literal string encoding is an issue is bogus. If you change literal string encoding there are so many other places that break it's not even funny. And at least I don't design for the implausible (what reason do we have to expect the encoding to change in the next fifty years?)
>
>
> Here's another way to think about it: #asByteArray violates the abstraction provided by String. A string is a sequence of characters, right? How that sequence is represented in memory is internal to the implementation, and should not be relied upon by users of the string. Historically this abstraction has been very leaky in Squeak, I suppose because of performance issues. (VW does a much better job of this - to convert strings to bytes, you have to specify an encoding, and when you re-encode a string you get a ByteArray and not another string. Immediate characters are a win.)
>
> As a result, Squeak is rife with code that assumes a particular encoding and treats strings as byte sequences rather than character sequences. That doesn't mean this instance is a good idea, though, it just means we have a lot of bad code floating around. Consider: if we *had* a tighter abstraction in String, switching the encoding would be much easier. Not trivial, because of string literals, source code encodings etc, but easier.
>
> BTW, I'm pretty familiar with issues arising from the encoding of Strings. I had code break under Squeak 3.8 because of m17n, and have long wrestled with encoding issues in web apps - strings go over the network in UTF-8, so either they get transcoded to MacRoman/Latin-1 inside the image and transcoded back to UTF-8 on the way out again, or you leave them in UTF-8 and live with the fact that you can't manipulate them inside the image.
>
> This particular case doesn't matter that much, but in general I'd like to see us moving in the direction of cleaner abstractions. Sure, there's lots of legacy code out there that makes assumptions about the internal structure of strings, but that's not an excuse to write more of it.


I don't think that sending #asByteArray to a String makes an assumption about it's internal structure.  Why is it any different than sending #asString to a Character?  Both *request* a transformation of the receiver that is convenient in a wide range of problem domains.  Both of these are far more like each other than passing a String argument to a primitive that treats it as bytes.

The reality is that C-isms such as 'true' are everywhere (eg: QuickTime uses hundreds of them), so why decree obfuscation rather than allowing natural use of existing idioms?  

Maybe a middle ground would be to remove #asByteArray from String, but leave it in ByteString?  If you have any doubt that the receiver is a ByteString, you can wrap it in an error-handler, or simply not send #asByteArray to it.  Is this any better, in your eyes?

Cheers,
Josh


> Colin


Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: TrueType-nice.13.mcz

Josh Gargus
In reply to this post by Nicolas Cellier

On Jan 20, 2010, at 11:02 PM, Nicolas Cellier wrote:

> 2010/1/21 Colin Putney <[hidden email]>:
>>
>> On 2010-01-19, at 10:04 AM, Andreas Raab wrote:
>>
>>> Regarding the "oh, this would be a problem if the string encoding changes", let's keep in mind that the string encoding *hasn't* changed for ASCII. So, no, it's specifically incorrect to say that the past change would have affected or invalidated that code. I don't think you understand how fundamental the impact of an encoding change is - if you did that *all* string literals would look like gobbly gook. The only reason we could do it for non-ascii was that we're not using any non-ascii characters and we were only switching the characters > 127 from Mac Roman to Unicode/Latin-1. If you don't believe me, grab your favorite non-ascii piece of text and throw it at "yourText squeakToMac" and have a look at it.
>>>
>>> So the argument that the dependency on literal string encoding is an issue is bogus. If you change literal string encoding there are so many other places that break it's not even funny. And at least I don't design for the implausible (what reason do we have to expect the encoding to change in the next fifty years?)
>>
>>
>> Here's another way to think about it: #asByteArray violates the abstraction provided by String. A string is a sequence of characters, right? How that sequence is represented in memory is internal to the implementation, and should not be relied upon by users of the string. Historically this abstraction has been very leaky in Squeak, I suppose because of performance issues. (VW does a much better job of this - to convert strings to bytes, you have to specify an encoding, and when you re-encode a string you get a ByteArray and not another string. Immediate characters are a win.)
>>
>> As a result, Squeak is rife with code that assumes a particular encoding and treats strings as byte sequences rather than character sequences. That doesn't mean this instance is a good idea, though, it just means we have a lot of bad code floating around. Consider: if we *had* a tighter abstraction in String, switching the encoding would be much easier. Not trivial, because of string literals, source code encodings etc, but easier.
>>
>> BTW, I'm pretty familiar with issues arising from the encoding of Strings. I had code break under Squeak 3.8 because of m17n, and have long wrestled with encoding issues in web apps - strings go over the network in UTF-8, so either they get transcoded to MacRoman/Latin-1 inside the image and transcoded back to UTF-8 on the way out again, or you leave them in UTF-8 and live with the fact that you can't manipulate them inside the image.
>>
>> This particular case doesn't matter that much, but in general I'd like to see us moving in the direction of cleaner abstractions. Sure, there's lots of legacy code out there that makes assumptions about the internal structure of strings, but that's not an excuse to write more of it.
>>
>> Colin
>>
>
> I'm not convinced.
> It's like saying that the fact that a true-type file begins with the
> de-facto universal encoding of 'true' is purely coincidental...
> IMO, it's the same expression of 'true' asByteArray and never was a
> series of bits 16r74727565 in the mind of whoever conceived the
> pattern.

+1

Josh


>
> Nicolas
>


Reply | Threaded
Open this post in threaded view
|

String encodings (was Re: The Trunk: TrueType-nice.13.mcz)

Bert Freudenberg
In reply to this post by Nicolas Cellier
On 21.01.2010, at 08:02, Nicolas Cellier wrote:

>
> 2010/1/21 Colin Putney <[hidden email]>:
>>
>> On 2010-01-19, at 10:04 AM, Andreas Raab wrote:
>>
>>> Regarding the "oh, this would be a problem if the string encoding changes", let's keep in mind that the string encoding *hasn't* changed for ASCII. So, no, it's specifically incorrect to say that the past change would have affected or invalidated that code. I don't think you understand how fundamental the impact of an encoding change is - if you did that *all* string literals would look like gobbly gook. The only reason we could do it for non-ascii was that we're not using any non-ascii characters and we were only switching the characters > 127 from Mac Roman to Unicode/Latin-1. If you don't believe me, grab your favorite non-ascii piece of text and throw it at "yourText squeakToMac" and have a look at it.
>>>
>>> So the argument that the dependency on literal string encoding is an issue is bogus. If you change literal string encoding there are so many other places that break it's not even funny. And at least I don't design for the implausible (what reason do we have to expect the encoding to change in the next fifty years?)
>>
>>
>> Here's another way to think about it: #asByteArray violates the abstraction provided by String. A string is a sequence of characters, right? How that sequence is represented in memory is internal to the implementation, and should not be relied upon by users of the string. Historically this abstraction has been very leaky in Squeak, I suppose because of performance issues. (VW does a much better job of this - to convert strings to bytes, you have to specify an encoding, and when you re-encode a string you get a ByteArray and not another string. Immediate characters are a win.)
>>
>> As a result, Squeak is rife with code that assumes a particular encoding and treats strings as byte sequences rather than character sequences. That doesn't mean this instance is a good idea, though, it just means we have a lot of bad code floating around. Consider: if we *had* a tighter abstraction in String, switching the encoding would be much easier. Not trivial, because of string literals, source code encodings etc, but easier.
>>
>> BTW, I'm pretty familiar with issues arising from the encoding of Strings. I had code break under Squeak 3.8 because of m17n, and have long wrestled with encoding issues in web apps - strings go over the network in UTF-8, so either they get transcoded to MacRoman/Latin-1 inside the image and transcoded back to UTF-8 on the way out again, or you leave them in UTF-8 and live with the fact that you can't manipulate them inside the image.
>>
>> This particular case doesn't matter that much, but in general I'd like to see us moving in the direction of cleaner abstractions. Sure, there's lots of legacy code out there that makes assumptions about the internal structure of strings, but that's not an excuse to write more of it.
>>
>> Colin
>>
>
> I'm not convinced.
> It's like saying that the fact that a true-type file begins with the
> de-facto universal encoding of 'true' is purely coincidental...
> IMO, it's the same expression of 'true' asByteArray and never was a
> series of bits 16r74727565 in the mind of whoever conceived the
> pattern.
>
> Nicolas

IMHO in the TrueType case, 'true' asByteArray is just fine, because the behavior of #asByteArray is well-specified if the receiver happens to be an ASCII string. And readability counts.

With that out of the way I have to side with Colin. Here is the problem in a nutshell:

#(127 128 256) collect: [:c | c -> {
        c asCharacter asString asByteArray.
        c asCharacter asString squeakToUtf8 asByteArray}]

==> { 127->#( #[127]     #[127] ) .
        128->#( #[128]     #[194 128] ) .
        256->#( #[0 0 1 0] #[196 128] ) }

In general, #asByteArray makes no practical sense unless you know the encoding of the string. The only kind of String knowing its encoding is WideString. I don't know a solution that's not painful, but being more explicit about encodings seems to be a Good Idea.

- Bert -



Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] String encodings (was Re: The Trunk: TrueType-nice.13.mcz)

Josh Gargus

On Jan 21, 2010, at 3:16 AM, Bert Freudenberg wrote:

> On 21.01.2010, at 08:02, Nicolas Cellier wrote:
>>
>> 2010/1/21 Colin Putney <[hidden email]>:
>>>
>>> On 2010-01-19, at 10:04 AM, Andreas Raab wrote:
>>>
>>>> Regarding the "oh, this would be a problem if the string encoding changes", let's keep in mind that the string encoding *hasn't* changed for ASCII. So, no, it's specifically incorrect to say that the past change would have affected or invalidated that code. I don't think you understand how fundamental the impact of an encoding change is - if you did that *all* string literals would look like gobbly gook. The only reason we could do it for non-ascii was that we're not using any non-ascii characters and we were only switching the characters > 127 from Mac Roman to Unicode/Latin-1. If you don't believe me, grab your favorite non-ascii piece of text and throw it at "yourText squeakToMac" and have a look at it.
>>>>
>>>> So the argument that the dependency on literal string encoding is an issue is bogus. If you change literal string encoding there are so many other places that break it's not even funny. And at least I don't design for the implausible (what reason do we have to expect the encoding to change in the next fifty years?)
>>>
>>>
>>> Here's another way to think about it: #asByteArray violates the abstraction provided by String. A string is a sequence of characters, right? How that sequence is represented in memory is internal to the implementation, and should not be relied upon by users of the string. Historically this abstraction has been very leaky in Squeak, I suppose because of performance issues. (VW does a much better job of this - to convert strings to bytes, you have to specify an encoding, and when you re-encode a string you get a ByteArray and not another string. Immediate characters are a win.)
>>>
>>> As a result, Squeak is rife with code that assumes a particular encoding and treats strings as byte sequences rather than character sequences. That doesn't mean this instance is a good idea, though, it just means we have a lot of bad code floating around. Consider: if we *had* a tighter abstraction in String, switching the encoding would be much easier. Not trivial, because of string literals, source code encodings etc, but easier.
>>>
>>> BTW, I'm pretty familiar with issues arising from the encoding of Strings. I had code break under Squeak 3.8 because of m17n, and have long wrestled with encoding issues in web apps - strings go over the network in UTF-8, so either they get transcoded to MacRoman/Latin-1 inside the image and transcoded back to UTF-8 on the way out again, or you leave them in UTF-8 and live with the fact that you can't manipulate them inside the image.
>>>
>>> This particular case doesn't matter that much, but in general I'd like to see us moving in the direction of cleaner abstractions. Sure, there's lots of legacy code out there that makes assumptions about the internal structure of strings, but that's not an excuse to write more of it.
>>>
>>> Colin
>>>
>>
>> I'm not convinced.
>> It's like saying that the fact that a true-type file begins with the
>> de-facto universal encoding of 'true' is purely coincidental...
>> IMO, it's the same expression of 'true' asByteArray and never was a
>> series of bits 16r74727565 in the mind of whoever conceived the
>> pattern.
>>
>> Nicolas
>
> IMHO in the TrueType case, 'true' asByteArray is just fine, because the behavior of #asByteArray is well-specified if the receiver happens to be an ASCII string. And readability counts.
>
> With that out of the way I have to side with Colin. Here is the problem in a nutshell:
>
> #(127 128 256) collect: [:c | c -> {
> c asCharacter asString asByteArray.
> c asCharacter asString squeakToUtf8 asByteArray}]
>
> ==> { 127->#( #[127]     #[127] ) .
> 128->#( #[128]     #[194 128] ) .
> 256->#( #[0 0 1 0] #[196 128] ) }
>


I can see that my previous post (where I advocated only implementing #asByteArray on ByteString) was misinformed... I did not expect "256 asCharacter asString squeakToUtf8 class == ByteString" to be true (goes to show how little I know).  Thanks for the nutshell.


> In general, #asByteArray makes no practical sense unless you know the encoding of the string. The only kind of String knowing its encoding is WideString. I don't know a solution that's not painful, but being more explicit about encodings seems to be a Good Idea.


Agreed.

Cheers,
Josh


>
> - Bert -
>
>
>