The Trunk: Multilingual-tpr.185.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

The Trunk: Multilingual-tpr.185.mcz

commits-2
tim Rowledge uploaded a new version of Multilingual to project The Trunk:
http://source.squeak.org/trunk/Multilingual-tpr.185.mcz

==================== Summary ====================

Name: Multilingual-tpr.185
Author: tpr
Time: 8 October 2013, 2:50:18.117 pm
UUID: 4417f293-d927-4f27-a55e-140178ab2eee
Ancestors: Multilingual-nice.184

Make the character encoders and language environments understand how to delgate the next step of character scanning

=============== Diff against Multilingual-nice.184 ===============

Item was changed:
  ----- Method: EncodedCharSet class>>charsetAt: (in category 'class methods') -----
  charsetAt: encoding
+ "Find  the char set encoding that matches 'encoding'; return a decent default rather than nil"
+ ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
-
- ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
  !

Item was added:
+ ----- Method: EncodedCharSet class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'accessing - displaying') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+ "the default for scanning multibyte characters- other more specific encodings may do something else"
+ ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: JapaneseEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+ "scanning multibyte Japanese strings"
+ ^aFont scanMultibyteJapaneseCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: LanguageEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+ "the default for scanning multibyte characters- other more specific encodings may do something else"
+ ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: String>>encodedCharSetAt: (in category '*Multilingual') -----
+ encodedCharSetAt: index
+ "return the character encoding in place at index; the actual EncodedCharSet, not just a number. A bad index is an Error"
+ ^EncodedCharSet charsetAt: (self at: index) leadingChar!


Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Multilingual-tpr.185.mcz

Nicolas Cellier
I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.


2013/10/8 <[hidden email]>
tim Rowledge uploaded a new version of Multilingual to project The Trunk:
http://source.squeak.org/trunk/Multilingual-tpr.185.mcz

==================== Summary ====================

Name: Multilingual-tpr.185
Author: tpr
Time: 8 October 2013, 2:50:18.117 pm
UUID: 4417f293-d927-4f27-a55e-140178ab2eee
Ancestors: Multilingual-nice.184

Make the character encoders and language environments understand how to delgate the next step of character scanning

=============== Diff against Multilingual-nice.184 ===============

Item was changed:
  ----- Method: EncodedCharSet class>>charsetAt: (in category 'class methods') -----
  charsetAt: encoding
+ "Find  the char set encoding that matches 'encoding'; return a decent default rather than nil"
+       ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
-
-       ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
  !

Item was added:
+ ----- Method: EncodedCharSet class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'accessing - displaying') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+       "the default for scanning multibyte characters- other more specific encodings may do something else"
+       ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: JapaneseEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+       "scanning multibyte Japanese strings"
+       ^aFont scanMultibyteJapaneseCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: LanguageEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
+ scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
+       "the default for scanning multibyte characters- other more specific encodings may do something else"
+       ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!

Item was added:
+ ----- Method: String>>encodedCharSetAt: (in category '*Multilingual') -----
+ encodedCharSetAt: index
+       "return the character encoding in place at index; the actual EncodedCharSet, not just a number. A bad index is an Error"
+       ^EncodedCharSet charsetAt: (self at: index) leadingChar!





Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Multilingual-tpr.185.mcz

timrowledge

On 08-10-2013, at 2:55 PM, Nicolas Cellier <[hidden email]> wrote:

> I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.

That would be fine by me, but I've been trying to change as little as possible at each step, especially when I have no information on why a particular choice was made. If you understand enough to feel it is a good change to make, I say go for it.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful Latin Phrases:- Furnulum pani nolo = I don't want a toaster.



Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Multilingual-tpr.185.mcz

Levente Uzonyi-2
In reply to this post by Nicolas Cellier
On Tue, 8 Oct 2013, Nicolas Cellier wrote:

> I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.

Wouldn't it be better to fill the EncodedCharSets array with Unicode
by default in EncodedCharSet class >> #initialize? (replace the line

  EncodedCharSets := Array new: 256.

with:

  EncodedCharSets := Array new: 256 withAll: Unicode
)

That way #charsetAt: could be simply

  ^EncodedCharSets at: encoding + 1


Levente

>
>
> 2013/10/8 <[hidden email]>
>       tim Rowledge uploaded a new version of Multilingual to project The Trunk:
>       http://source.squeak.org/trunk/Multilingual-tpr.185.mcz
>
>       ==================== Summary ====================
>
>       Name: Multilingual-tpr.185
>       Author: tpr
>       Time: 8 October 2013, 2:50:18.117 pm
>       UUID: 4417f293-d927-4f27-a55e-140178ab2eee
>       Ancestors: Multilingual-nice.184
>
>       Make the character encoders and language environments understand how to delgate the next step of character scanning
>
>       =============== Diff against Multilingual-nice.184 ===============
>
>       Item was changed:
>         ----- Method: EncodedCharSet class>>charsetAt: (in category 'class methods') -----
>         charsetAt: encoding
>       + "Find  the char set encoding that matches 'encoding'; return a decent default rather than nil"
>       +       ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
>       -
>       -       ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
>         !
>
>       Item was added:
>       + ----- Method: EncodedCharSet class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'accessing - displaying') -----
>       + scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
>       +       "the default for scanning multibyte characters- other more specific encodings may do something else"
>       +       ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!
>
>       Item was added:
>       + ----- Method: JapaneseEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
>       + scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
>       +       "scanning multibyte Japanese strings"
>       +       ^aFont scanMultibyteJapaneseCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!
>
>       Item was added:
>       + ----- Method: LanguageEnvironment class>>scanMultibyteCharactersFrom:to:in:with:rightX:font: (in category 'language methods') -----
>       + scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX font: aFont
>       +       "the default for scanning multibyte characters- other more specific encodings may do something else"
>       +       ^aFont scanMultibyteCharactersFrom: startIndex to: stopIndex in: aWideString with: aCharacterScanner rightX: rightX!
>
>       Item was added:
>       + ----- Method: String>>encodedCharSetAt: (in category '*Multilingual') -----
>       + encodedCharSetAt: index
>       +       "return the character encoding in place at index; the actual EncodedCharSet, not just a number. A bad index is an Error"
>       +       ^EncodedCharSet charsetAt: (self at: index) leadingChar!
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Multilingual-tpr.185.mcz

Bert Freudenberg

On 09.10.2013, at 00:52, Levente Uzonyi <[hidden email]> wrote:

> On Tue, 8 Oct 2013, Nicolas Cellier wrote:
>
>> I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.
>
> Wouldn't it be better to fill the EncodedCharSets array with Unicode by default in EncodedCharSet class >> #initialize? (replace the line
>
> EncodedCharSets := Array new: 256.
>
> with:
>
> EncodedCharSets := Array new: 256 withAll: Unicode
> )
>
> That way #charsetAt: could be simply
>
> ^EncodedCharSets at: encoding + 1
>
>
> Levente


IMHO that would obscure the intention. It is technically equivalent, yes, but I'd like to see the explicit default. Most readable might be this:

        ^ (EncodedCharSets at: encoding + 1) ifNil: [Unicode]

We could even skip the "+ 1" part and only store the encoded charsets in EncodedCharSets. Unicode is not encoded, which is well-expressed by the code 0.

        ^ (EncodedCharSets at: encoding ifAbsent: [nil]) ifNil: [Unicode]


- Bert -

>>        charsetAt: encoding
>>      + "Find  the char set encoding that matches 'encoding'; return a decent default rather than nil"
>>      +       ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
>>      -
>>      -       ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
>>        !




Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Multilingual-tpr.185.mcz

Hannes Hirzel
On 10/9/13, Bert Freudenberg <[hidden email]> wrote:

>
> On 09.10.2013, at 00:52, Levente Uzonyi <[hidden email]> wrote:
>
>> On Tue, 8 Oct 2013, Nicolas Cellier wrote:
>>
>>> I would prefer decent default being ^Unicode, if ever (EncodedCharSets
>>> at:1) isNil for some (bad) reason.
>>
>> Wouldn't it be better to fill the EncodedCharSets array with Unicode by
>> default in EncodedCharSet class >> #initialize? (replace the line
>>
>> EncodedCharSets := Array new: 256.
>>
>> with:
>>
>> EncodedCharSets := Array new: 256 withAll: Unicode
>> )
>>
>> That way #charsetAt: could be simply
>>
>> ^EncodedCharSets at: encoding + 1
>>
>>
>> Levente
>
>
> IMHO that would obscure the intention. It is technically equivalent, yes,
> but I'd like to see the explicit default. Most readable might be this:
>
> ^ (EncodedCharSets at: encoding + 1) ifNil: [Unicode]
>
> We could even skip the "+ 1" part and only store the encoded charsets in
> EncodedCharSets. Unicode is not encoded, which is well-expressed by the code
> 0.
>



> ^ (EncodedCharSets at: encoding ifAbsent: [nil]) ifNil: [Unicode]


+1 for this as intention-revealing.  Tells us that Unicode is the default.


> - Bert -
>
>>>        charsetAt: encoding
>>>      + "Find  the char set encoding that matches 'encoding'; return a
>>> decent default rather than nil"
>>>      +       ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets
>>> at: 1].
>>>      -
>>>      -       ^ EncodedCharSets at: encoding + 1 ifAbsent:
>>> [EncodedCharSets at: 1].
>>>        !
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Multilingual-tpr.185.mcz

Levente Uzonyi-2
In reply to this post by Bert Freudenberg
On Wed, 9 Oct 2013, Bert Freudenberg wrote:

>
> On 09.10.2013, at 00:52, Levente Uzonyi <[hidden email]> wrote:
>
>> On Tue, 8 Oct 2013, Nicolas Cellier wrote:
>>
>>> I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.
>>
>> Wouldn't it be better to fill the EncodedCharSets array with Unicode by default in EncodedCharSet class >> #initialize? (replace the line
>>
>> EncodedCharSets := Array new: 256.
>>
>> with:
>>
>> EncodedCharSets := Array new: 256 withAll: Unicode
>> )
>>
>> That way #charsetAt: could be simply
>>
>> ^EncodedCharSets at: encoding + 1
>>
>>
>> Levente
>
>
> IMHO that would obscure the intention. It is technically equivalent, yes, but I'd like to see the explicit default. Most readable might be this:

I think it's better, because the intention is expressed in a single
method, instead of two. The explicit default is there, but in #initialize.

>
> ^ (EncodedCharSets at: encoding + 1) ifNil: [Unicode]
>
> We could even skip the "+ 1" part and only store the encoded charsets in EncodedCharSets. Unicode is not encoded, which is well-expressed by the code 0.
>
> ^ (EncodedCharSets at: encoding ifAbsent: [nil]) ifNil: [Unicode]

Performance wise it's better to keep the "+ 1", and even better to save
the #ifNil: too. :)


Levente

>
>
> - Bert -
>
>>>        charsetAt: encoding
>>>      + "Find  the char set encoding that matches 'encoding'; return a decent default rather than nil"
>>>      +       ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
>>>      -
>>>      -       ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
>>>        !
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Multilingual-tpr.185.mcz

Nicolas Cellier
I don't know if this micro-benchmark is relevant, since the charsetAt: should be inquired only at a leadingChar change. (the send should be put out of the scanJapaneseCharactersFrom: loop).
I should also run more than once, but here it is

| tmp |
tmp := {Unicode. nil}.
{
[tmp at: 1] bench.
[(tmp at: 1) ifNil: [Unicode]] bench.
[(tmp at: 2) ifNil: [Unicode]] bench.
[tmp at: 1 ifAbsent: [Unicode]] bench.
[tmp at: 0 ifAbsent: [Unicode]] bench.
[(tmp at: 0 ifAbsent: [nil]) ifNil: [Unicode]] bench.
[(tmp at: 0 ifAbsent: nil) ifNil: [Unicode]] bench.
}
 #(
'22,900,000 per second.'
'22,700,000 per second.'
'18,500,000 per second.'
'5,570,000 per second.'
'5,200,000 per second.'
'5,160,000 per second.'
'14,600,000 per second.'
)

The major cost of at:ifAbsent: currently seem to be the Closure...
Cheating with this property: nil value -> nil makes a difference.

Shall we make provisions for leadingChar > 256 in next 64bits Spur image, or will immediate characters be restricted to 32bits?
Note that leadingChar could already reach 1023 (10 bits), because there is no reason to restrict a WordArray content (32 bits) to small positive integers (30 bits), except a convention for not slowing down things too much with LargeIntegers...
The ifAbsent: is protecting us from such crafted MalCharacter.


2013/10/10 Levente Uzonyi <[hidden email]>
On Wed, 9 Oct 2013, Bert Freudenberg wrote:


On 09.10.2013, at 00:52, Levente Uzonyi <[hidden email]> wrote:

On Tue, 8 Oct 2013, Nicolas Cellier wrote:

I would prefer decent default being ^Unicode, if ever (EncodedCharSets at:1) isNil for some (bad) reason.

Wouldn't it be better to fill the EncodedCharSets array with Unicode by default in EncodedCharSet class >> #initialize? (replace the line

        EncodedCharSets := Array new: 256.

with:

        EncodedCharSets := Array new: 256 withAll: Unicode
)

That way #charsetAt: could be simply

        ^EncodedCharSets at: encoding + 1


Levente


IMHO that would obscure the intention. It is technically equivalent, yes, but I'd like to see the explicit default. Most readable might be this:

I think it's better, because the intention is expressed in a single method, instead of two. The explicit default is there, but in #initialize.



        ^ (EncodedCharSets at: encoding + 1) ifNil: [Unicode]

We could even skip the "+ 1" part and only store the encoded charsets in EncodedCharSets. Unicode is not encoded, which is well-expressed by the code 0.

        ^ (EncodedCharSets at: encoding ifAbsent: [nil]) ifNil: [Unicode]

Performance wise it's better to keep the "+ 1", and even better to save the #ifNil: too. :)


Levente




- Bert -

       charsetAt: encoding
     + "Find  the char set encoding that matches 'encoding'; return a decent default rather than nil"
     +       ^ (EncodedCharSets at: encoding + 1) ifNil: [EncodedCharSets at: 1].
     -
     -       ^ EncodedCharSets at: encoding + 1 ifAbsent: [EncodedCharSets at: 1].
       !