Smalltalk › Squeak › Squeak - Dev

The Trunk: Multilingual-tpr.185.mcz

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

8 messages Options

commits-2

The Trunk: Multilingual-tpr.185.mcz

tim Rowledge uploaded a new version of Multilingual to project The Trunk:
http://source.squeak.org/trunk/Multilingual-tpr.185.mcz

==================== Summary ====================

Name: Multilingual-tpr.185
Author: tpr
Time: 8 October 2013, 2:50:18.117 pm
UUID: 4417f293-d927-4f27-a55e-140178ab2eee
Ancestors: Multilingual-nice.184

Make the character encoders and language environments understand how to delgate the next step of character scanning

=============== Diff against Multilingual-nice.184 ===============

Item was changed:
----- Method: EncodedCharSet class>>charsetAt: (in category 'class methods') -----
charsetAt: encoding
+ "Find the char set encoding that matches 'encoding'; return a decent default rather than nil"
+ ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
-
- ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
!

Item was added:
+ ----- Method: EncodedCharSet class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'accessing - displaying') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+ "the default for scanning multibyte characters- other more specific encodings may do something else"
+ ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: JapaneseEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+ "scanning multibyte Japanese strings"
+ ^aFont scanMultibyteJapaneseCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: LanguageEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+ "the default for scanning multibyte characters- other more specific encodings may do something else"
+ ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: String>>encodedCharSetAt: (in category '*Multilingual') -----
+ encodedCharSetAt: index
+ "return the character encoding in place at index; the actual EncodedCharSet, not just a number. A bad index is an Error"
+ ^EncodedCharSet charsetAt: (self at: index) leadingChar!

Nicolas Cellier

Re: The Trunk: Multilingual-tpr.185.mcz

I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.

2013/10/8 <[hidden email]>

tim Rowledge uploaded a new version of Multilingual to project The Trunk:
http://source.squeak.org/trunk/Multilingual-tpr.185.mcz

==================== Summary ====================

Name: Multilingual-tpr.185
Author: tpr
Time: 8 October 2013, 2:50:18.117 pm
UUID: 4417f293-d927-4f27-a55e-140178ab2eee
Ancestors: Multilingual-nice.184

Make the character encoders and language environments understand how to delgate the next step of character scanning

=============== Diff against Multilingual-nice.184 ===============

Item was changed:
----- Method: EncodedCharSet class>>charsetAt: (in category 'class methods') -----
charsetAt: encoding
+ "Find the char set encoding that matches 'encoding'; return a decent default rather than nil"
+ ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
-
- ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
!

Item was added:
+ ----- Method: EncodedCharSet class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'accessing - displaying') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+ "the default for scanning multibyte characters- other more specific encodings may do something else"
+ ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: JapaneseEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+ "scanning multibyte Japanese strings"
+ ^aFont scanMultibyteJapaneseCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: LanguageEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+ "the default for scanning multibyte characters- other more specific encodings may do something else"
+ ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: String>>encodedCharSetAt: (in category '*Multilingual') -----
+ encodedCharSetAt: index
+ "return the character encoding in place at index; the actual EncodedCharSet, not just a number. A bad index is an Error"
+ ^EncodedCharSet charsetAt: (self at: index) leadingChar!

timrowledge

Re: The Trunk: Multilingual-tpr.185.mcz

On 08-10-2013, at 2:55 PM, Nicolas Cellier <[hidden email]> wrote:

> I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.

That would be fine by me, but I've been trying to change as little as possible at each step, especially when I have no information on why a particular choice was made. If you understand enough to feel it is a good change to make, I say go for it.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful Latin Phrases:- Furnulum pani nolo = I don't want a toaster.

Levente Uzonyi-2

Re: The Trunk: Multilingual-tpr.185.mcz

In reply to this post by Nicolas Cellier

On Tue, 8 Oct 2013, Nicolas Cellier wrote:

> I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.

Wouldn't it be better to fill the EncodedCharSets array with Unicode
by default in EncodedCharSet class >> #initialize? (replace the line

EncodedCharSets := Array new: 256.

with:

EncodedCharSets := Array new: 256 withAll: Unicode
)

That way #charsetAt: could be simply

^EncodedCharSets at: encoding + 1

Levente

>
>
> 2013/10/8 <[hidden email]>
> tim Rowledge uploaded a new version of Multilingual to project The Trunk:
> http://source.squeak.org/trunk/Multilingual-tpr.185.mcz
>
> ==================== Summary ====================
>
> Name: Multilingual-tpr.185
> Author: tpr
> Time: 8 October 2013, 2:50:18.117 pm
> UUID: 4417f293-d927-4f27-a55e-140178ab2eee
> Ancestors: Multilingual-nice.184
>
> Make the character encoders and language environments understand how to delgate the next step of character scanning
>
> =============== Diff against Multilingual-nice.184 ===============
>
> Item was changed:
> ----- Method: EncodedCharSet class>>charsetAt: (in category 'class methods') -----
> charsetAt: encoding
> + "Find the char set encoding that matches 'encoding'; return a decent default rather than nil"
> + ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
> -
> - ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
> !
>
> Item was added:
> + ----- Method: EncodedCharSet class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'accessing - displaying') -----
> + scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
> + "the default for scanning multibyte characters- other more specific encodings may do something else"
> + ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!
>
> Item was added:
> + ----- Method: JapaneseEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
> + scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
> + "scanning multibyte Japanese strings"
> + ^aFont scanMultibyteJapaneseCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!
>
> Item was added:
> + ----- Method: LanguageEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
> + scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
> + "the default for scanning multibyte characters- other more specific encodings may do something else"
> + ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!
>
> Item was added:
> + ----- Method: String>>encodedCharSetAt: (in category '*Multilingual') -----
> + encodedCharSetAt: index
> + "return the character encoding in place at index; the actual EncodedCharSet, not just a number. A bad index is an Error"
> + ^EncodedCharSet charsetAt: (self at: index) leadingChar!
>
>
>
>
>

Bert Freudenberg

Re: The Trunk: Multilingual-tpr.185.mcz

On 09.10.2013, at 00:52, Levente Uzonyi <[hidden email]> wrote:

> On Tue, 8 Oct 2013, Nicolas Cellier wrote:
>
>> I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.
>
> Wouldn't it be better to fill the EncodedCharSets array with Unicode by default in EncodedCharSet class >> #initialize? (replace the line
>
> EncodedCharSets := Array new: 256.
>
> with:
>
> EncodedCharSets := Array new: 256 withAll: Unicode
> )
>
> That way #charsetAt: could be simply
>
> ^EncodedCharSets at: encoding + 1
>
>
> Levente

IMHO that would obscure the intention. It is technically equivalent, yes, but I'd like to see the explicit default. Most readable might be this:

^ (EncodedCharSets at: encoding + 1) ifNil: [Unicode]

We could even skip the "+ 1" part and only store the encoded charsets in EncodedCharSets. Unicode is not encoded, which is well-expressed by the code 0.

^ (EncodedCharSets at: encoding ifAbsent: [nil]) ifNil: [Unicode]

- Bert -

>> charsetAt: encoding
>> + "Find the char set encoding that matches 'encoding'; return a decent default rather than nil"
>> + ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
>> -
>> - ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
>> !

Hannes Hirzel

Re: The Trunk: Multilingual-tpr.185.mcz

On 10/9/13, Bert Freudenberg <[hidden email]> wrote:

>
> On 09.10.2013, at 00:52, Levente Uzonyi <[hidden email]> wrote:
>
>> On Tue, 8 Oct 2013, Nicolas Cellier wrote:
>>
>>> I would prefer decent default being ^Unicode, if ever (EncodedCharSets
>>> at:1) isNil for some (bad) reason.
>>
>> Wouldn't it be better to fill the EncodedCharSets array with Unicode by
>> default in EncodedCharSet class >> #initialize? (replace the line
>>
>> EncodedCharSets := Array new: 256.
>>
>> with:
>>
>> EncodedCharSets := Array new: 256 withAll: Unicode
>> )
>>
>> That way #charsetAt: could be simply
>>
>> ^EncodedCharSets at: encoding + 1
>>
>>
>> Levente
>
>
> IMHO that would obscure the intention. It is technically equivalent, yes,
> but I'd like to see the explicit default. Most readable might be this:
>
> ^ (EncodedCharSets at: encoding + 1) ifNil: [Unicode]
>
> We could even skip the "+ 1" part and only store the encoded charsets in
> EncodedCharSets. Unicode is not encoded, which is well-expressed by the code
> 0.
>

> ^ (EncodedCharSets at: encoding ifAbsent: [nil]) ifNil: [Unicode]

+1 for this as intention-revealing. Tells us that Unicode is the default.

> - Bert -
>
>>> charsetAt: encoding
>>> + "Find the char set encoding that matches 'encoding'; return a
>>> decent default rather than nil"
>>> + ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets
>>> at: 1].
>>> -
>>> - ^ EncodedCharSets at: encoding + 1 ifAbsent:
>>> [EncodedCharSets at: 1].
>>> !
>
>
>
>
>

Levente Uzonyi-2

Re: The Trunk: Multilingual-tpr.185.mcz

In reply to this post by Bert Freudenberg

On Wed, 9 Oct 2013, Bert Freudenberg wrote:

>
> On 09.10.2013, at 00:52, Levente Uzonyi <[hidden email]> wrote:
>
>> On Tue, 8 Oct 2013, Nicolas Cellier wrote:
>>
>>> I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.
>>
>> Wouldn't it be better to fill the EncodedCharSets array with Unicode by default in EncodedCharSet class >> #initialize? (replace the line
>>
>> EncodedCharSets := Array new: 256.
>>
>> with:
>>
>> EncodedCharSets := Array new: 256 withAll: Unicode
>> )
>>
>> That way #charsetAt: could be simply
>>
>> ^EncodedCharSets at: encoding + 1
>>
>>
>> Levente
>
>
> IMHO that would obscure the intention. It is technically equivalent, yes, but I'd like to see the explicit default. Most readable might be this:

I think it's better, because the intention is expressed in a single
method, instead of two. The explicit default is there, but in #initialize.

>
> ^ (EncodedCharSets at: encoding + 1) ifNil: [Unicode]
>
> We could even skip the "+ 1" part and only store the encoded charsets in EncodedCharSets. Unicode is not encoded, which is well-expressed by the code 0.
>
> ^ (EncodedCharSets at: encoding ifAbsent: [nil]) ifNil: [Unicode]

Performance wise it's better to keep the "+ 1", and even better to save
the #ifNil: too. :)

Levente

>
>
> - Bert -
>
>>> charsetAt: encoding
>>> + "Find the char set encoding that matches 'encoding'; return a decent default rather than nil"
>>> + ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
>>> -
>>> - ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
>>> !
>
>
>
>
>

Nicolas Cellier

Re: The Trunk: Multilingual-tpr.185.mcz

I don't know if this micro-benchmark is relevant, since the charsetAt: should be inquired only at a leadingChar change. (the send should be put out of the scanJapaneseCharactersFrom: loop).

I should also run more than once, but here it is

| tmp |
tmp := {Unicode. nil}.
{
[tmp at: 1] bench.
[(tmp at: 1) ifNil: [Unicode]] bench.
[(tmp at: 2) ifNil: [Unicode]] bench.
[tmp at: 1 ifAbsent: [Unicode]] bench.
[tmp at: 0 ifAbsent: [Unicode]] bench.
[(tmp at: 0 ifAbsent: [nil]) ifNil: [Unicode]] bench.
[(tmp at: 0 ifAbsent: nil) ifNil: [Unicode]] bench.
}
#(
'22,900,000 per second.'
'22,700,000 per second.'
'18,500,000 per second.'
'5,570,000 per second.'
'5,200,000 per second.'
'5,160,000 per second.'
'14,600,000 per second.'
)

The major cost of at:ifAbsent: currently seem to be the Closure...
Cheating with this property: nil value -> nil makes a difference.

Shall we make provisions for leadingChar > 256 in next 64bits Spur image, or will immediate characters be restricted to 32bits?

Note that leadingChar could already reach 1023 (10 bits), because there is no reason to restrict a WordArray content (32 bits) to small positive integers (30 bits), except a convention for not slowing down things too much with LargeIntegers...

The ifAbsent: is protecting us from such crafted MalCharacter.

2013/10/10 Levente Uzonyi <[hidden email]>

On Wed, 9 Oct 2013, Bert Freudenberg wrote:

On 09.10.2013, at 00:52, Levente Uzonyi <[hidden email]> wrote:

On Tue, 8 Oct 2013, Nicolas Cellier wrote:

I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.

Wouldn't it be better to fill the EncodedCharSets array with Unicode by default in EncodedCharSet class >> #initialize? (replace the line

EncodedCharSets := Array new: 256.

with:

EncodedCharSets := Array new: 256 withAll: Unicode
)

That way #charsetAt: could be simply

^EncodedCharSets at: encoding + 1

Levente

IMHO that would obscure the intention. It is technically equivalent, yes, but I'd like to see the explicit default. Most readable might be this:

I think it's better, because the intention is expressed in a single method, instead of two. The explicit default is there, but in #initialize.

^ (EncodedCharSets at: encoding + 1) ifNil: [Unicode]

We could even skip the "+ 1" part and only store the encoded charsets in EncodedCharSets. Unicode is not encoded, which is well-expressed by the code 0.

^ (EncodedCharSets at: encoding ifAbsent: [nil]) ifNil: [Unicode]

Performance wise it's better to keep the "+ 1", and even better to save the #ifNil: too. :)

Levente

- Bert -

charsetAt: encoding
+ "Find the char set encoding that matches 'encoding'; return a decent default rather than nil"
+ ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
-
- ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
!