String >> #=

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

String >> #=

Philippe Marschall-2
Hi

I have been investigating why Dictionary look up performance with String
keys is not as good as I would expected. Something I noted is that
String >> #= is implemented in terms of #compare:with:collated:. There
is no short circuit if Strings are not the same size. In my case some
Strings have the same prefix but a different length eg 'Content-Type'
and 'Content-Length'. In that case a #compare:with:collated: is
performed even though we know in advance the answer will be false
because they have different sizes.

Cheers
Philippe


Reply | Threaded
Open this post in threaded view
|

Re: String >> #=

Clément Béra
Hello,

for this use case people typically use Symbols (much faster).

Now you are right we may add a shortcut in string comparison.


2014-05-26 9:51 GMT+02:00 Philippe Marschall <[hidden email]>:
Hi

I have been investigating why Dictionary look up performance with String keys is not as good as I would expected. Something I noted is that String >> #= is implemented in terms of #compare:with:collated:. There is no short circuit if Strings are not the same size. In my case some Strings have the same prefix but a different length eg 'Content-Type' and 'Content-Length'. In that case a #compare:with:collated: is performed even though we know in advance the answer will be false because they have different sizes.

Cheers
Philippe



Reply | Threaded
Open this post in threaded view
|

Re: String >> #=

Philippe Marschall-2
On 26.05.14 11:08, Clément Bera wrote:
> Hello,
>
> for this use case people typically use Symbols (much faster).

I don't see how that helps because #asSymbol does a look up into a
dictionary with a String key as well.

Cheers
Philippe


Reply | Threaded
Open this post in threaded view
|

Re: String >> #=

Eliot Miranda-2
In reply to this post by Philippe Marschall-2
Hi Phillipe,


On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall <[hidden email]> wrote:
Hi

I have been investigating why Dictionary look up performance with String keys is not as good as I would expected. Something I noted is that String >> #= is implemented in terms of #compare:with:collated:. There is no short circuit if Strings are not the same size. In my case some Strings have the same prefix but a different length eg 'Content-Type' and 'Content-Length'. In that case a #compare:with:collated: is performed even though we know in advance the answer will be false because they have different sizes.

Why not rewrite

String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."
aString isString ifFalse: [ ^ false ].
^ (self compare: self with: aString collated: AsciiOrder) = 2

as

String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."

(aString isString
and: [self size = aString size]) ifFalse: [^false].
^ (self compare: self withSize: with: aString collated: AsciiOrder) = 2

?

One /could/ add a replacement compare:with:collated: primitive primitiveCompareString which took the sizes as arguments to avoid asking twice.  But it wouldn't be safe.  One could abuse the primitive and lie about the size.  So I suspect it is best to add the size check to String>>#= and accept the duplication of the primitive finding the sizes of the two strings.  The cost in the primitive is minimal.  A WideString version of the primitive might pay its way, but if Spur and Sista arrive soon the primitive shouldn't be faster than the optimised Smalltalk code.
--
best,
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: String >> #=

Philippe Marschall-2
On 27.05.14 06:45, Eliot Miranda wrote:

>
>
>
>
> Hi Phillipe,
>
>
> On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall
> <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi
>
>     I have been investigating why Dictionary look up performance with
>     String keys is not as good as I would expected. Something I noted is
>     that String >> #= is implemented in terms of
>     #compare:with:collated:. There is no short circuit if Strings are
>     not the same size. In my case some Strings have the same prefix but
>     a different length eg 'Content-Type' and 'Content-Length'. In that
>     case a #compare:with:collated: is performed even though we know in
>     advance the answer will be false because they have different sizes.
>
>
> Why not rewrite

Yes, that would be my proposal as well.

Cheers
Philippe



Reply | Threaded
Open this post in threaded view
|

Re: String >> #=

Marcus Denker-4

On 27 May 2014, at 09:44, Philippe Marschall <[hidden email]> wrote:

> On 27.05.14 06:45, Eliot Miranda wrote:
>>
>>
>>
>>
>> Hi Phillipe,
>>
>>
>> On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall
>> <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>    Hi
>>
>>    I have been investigating why Dictionary look up performance with
>>    String keys is not as good as I would expected. Something I noted is
>>    that String >> #= is implemented in terms of
>>    #compare:with:collated:. There is no short circuit if Strings are
>>    not the same size. In my case some Strings have the same prefix but
>>    a different length eg 'Content-Type' and 'Content-Length'. In that
>>    case a #compare:with:collated: is performed even though we know in
>>    advance the answer will be false because they have different sizes.
>>
>>
>> Why not rewrite
>
> Yes, that would be my proposal as well.
>

Ok, I have added an issue:

        https://pharo.fogbugz.com/f/cases/13277/Speed-up-String



        Marcus


Reply | Threaded
Open this post in threaded view
|

Re: String >> #=

Philippe Marschall-2
In reply to this post by Philippe Marschall-2
On 26.05.14 09:51, Philippe Marschall wrote:

> Hi
>
> I have been investigating why Dictionary look up performance with String
> keys is not as good as I would expected. Something I noted is that
> String >> #= is implemented in terms of #compare:with:collated:. There
> is no short circuit if Strings are not the same size. In my case some
> Strings have the same prefix but a different length eg 'Content-Type'
> and 'Content-Length'. In that case a #compare:with:collated: is
> performed even though we know in advance the answer will be false
> because they have different sizes.

Some follow up on this:
In Seaside we have a custom dictionary GRSmallDictionary which is more
or less an OrderedCollection of Associations (not actually but
conceptually). I did a microbenchmark with 11 keys [1] and out of the
box Dictionary is about twice as fast as GRSmallDictionary. However when
I change GRSmallDictionary to first send and compare #size before
sending #= then GRSmallDictionary is slightly faster than Dictionary.

  [1] #('Content-Type' 'Content-Language' 'Content-Length' 'Date'
'Last-Modified' 'Location' 'Set-Cookie' 'Set-Cookie2' 'Servlet-Engine'
'Status' 'WWW-Authenticate')

Cheers
Philippe



Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: String >> #=

J. Vuletich (mail lists)
In reply to this post by Eliot Miranda-2

Quoting Eliot Miranda <[hidden email]>:

Hi Phillipe,


On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall <[hidden email]> wrote:

Hi

I have been investigating why Dictionary look up performance with String keys is not as good as I would expected. Something I noted is that String >> #= is implemented in terms of #compare:with:collated:. There is no short circuit if Strings are not the same size. In my case some Strings have the same prefix but a different length eg 'Content-Type' and 'Content-Length'. In that case a #compare:with:collated: is performed even though we know in advance the answer will be false because they have different sizes.

 
Why not rewrite
 
String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."
aString isString ifFalse: [ ^ false ].
^ (self compare: self with: aString collated: AsciiOrder) = 2
 
as
 
String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."
 
(aString isString
and: [self size = aString size]) ifFalse: [^false].
^ (self compare: self withSize: with: aString collated: AsciiOrder) = 2
 
?
 
One /could/ add a replacement compare:with:collated: primitive primitiveCompareString which took the sizes as arguments to avoid asking twice.  But it wouldn't be safe.  One could abuse the primitive and lie about the size.  So I suspect it is best to add the size check to String>>#= and accept the duplication of the primitive finding the sizes of the two strings.  The cost in the primitive is minimal.  A WideString version of the primitive might pay its way, but if Spur and Sista arrive soon the primitive shouldn't be faster than the optimised Smalltalk code.
--
best,
Eliot

BTW, any good reason for not prefixing all the implementors of #= with this?

"Any object is equal to itself"
self == argument ifTrue: [ ^ true ].

Cheers,
Juan Vuletich
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: String >> #=

Eliot Miranda-2



On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists) <[hidden email]> wrote:

Quoting Eliot Miranda <[hidden email]>:

Hi Phillipe,


On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall <[hidden email]> wrote:

Hi

I have been investigating why Dictionary look up performance with String keys is not as good as I would expected. Something I noted is that String >> #= is implemented in terms of #compare:with:collated:. There is no short circuit if Strings are not the same size. In my case some Strings have the same prefix but a different length eg 'Content-Type' and 'Content-Length'. In that case a #compare:with:collated: is performed even though we know in advance the answer will be false because they have different sizes.

 
Why not rewrite
 
String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."
aString isString ifFalse: [ ^ false ].
^ (self compare: self with: aString collated: AsciiOrder) = 2
 
as
 
String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."
 
(aString isString
and: [self size = aString size]) ifFalse: [^false].
^ (self compare: self withSize: with: aString collated: AsciiOrder) = 2
 
?

This makes a huge difference, over 3 times faster:

| bs t1 t2 |
bs := ByteString allInstances first: 10000.
t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
(FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
{ t1. t2 } #(13726 4467)
4467 - 13726 / 137.26 -67.46% 
 
One /could/ add a replacement compare:with:collated: primitive primitiveCompareString which took the sizes as arguments to avoid asking twice.  But it wouldn't be safe.  One could abuse the primitive and lie about the size.  So I suspect it is best to add the size check to String>>#= and accept the duplication of the primitive finding the sizes of the two strings.  The cost in the primitive is minimal.  A WideString version of the primitive might pay its way, but if Spur and Sista arrive soon the primitive shouldn't be faster than the optimised Smalltalk code.
--
best,
Eliot

BTW, any good reason for not prefixing all the implementors of #= with this?

"Any object is equal to itself"
self == argument ifTrue: [ ^ true ].


It doesn't make much difference:

| bs t1 t2 |
bs := ByteString allInstances first: 10000.
t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
(FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
{ t1. t2 } #(4628 4560)

4560 - 4628 / 46.28 -1.47%

So is it worth it?  If you feel it is I've no objection other than it feels a little kludgey for such little benefit.  And there are the Symbols if one needs quick comparison and can bear the cost of slow interning.
--
best,
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: String >> #=

Andres Valloud-4
What is going to happen when one compares two general Unicode series of
characters that represent the same string but differ in normalization?
Wouldn't the size test would result in false negatives?

http://unicode.org/reports/tr15/

I'm asking because I haven't seen any discussion on the subject, and the
decision to change the code as proposed could have side effects.

On 5/27/14 11:59 , Eliot Miranda wrote:

>
>
>
> On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists)
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     __
>
>     Quoting Eliot Miranda <[hidden email]
>     <mailto:[hidden email]>>:
>
>>     Hi Phillipe,
>>
>>
>>     On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall
>>     <[hidden email]
>>     <mailto:[hidden email]>> wrote:
>>
>>         Hi
>>
>>         I have been investigating why Dictionary look up performance
>>         with String keys is not as good as I would expected. Something
>>         I noted is that String >> #= is implemented in terms of
>>         #compare:with:collated:. There is no short circuit if Strings
>>         are not the same size. In my case some Strings have the same
>>         prefix but a different length eg 'Content-Type' and
>>         'Content-Length'. In that case a #compare:with:collated: is
>>         performed even though we know in advance the answer will be
>>         false because they have different sizes.
>>
>>     Why not rewrite
>>     String>>= aString
>>     "Answer whether the receiver sorts equally as aString.
>>     The collation order is simple ascii (with case differences)."
>>     aString isString ifFalse: [ ^ false ].
>>     ^ (self compare: self with: aString collated: AsciiOrder) = 2
>>     as
>>     String>>= aString
>>     "Answer whether the receiver sorts equally as aString.
>>     The collation order is simple ascii (with case differences)."
>>     (aString isString
>>     and: [self size = aString size]) ifFalse: [^false].
>>     ^ (self compare: self withSize: with: aString collated:
>>     AsciiOrder) = 2
>>     ?
>
>
> This makes a huge difference, over 3 times faster:
>
> | bs t1 t2 |
> bs := ByteString allInstances first: 10000.
> t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
> t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> { t1. t2 } #(13726 4467)
> 4467 - 13726 / 137.26 -67.46%
>
>>     One /could/ add a replacement compare:with:collated:
>>     primitive primitiveCompareString which took the sizes as arguments
>>     to avoid asking twice.  But it wouldn't be safe.  One could abuse
>>     the primitive and lie about the size.  So I suspect it is best to
>>     add the size check to String>>#= and accept the duplication of
>>     the primitive finding the sizes of the two strings.  The cost in
>>     the primitive is minimal.  A WideString version of the primitive
>>     might pay its way, but if Spur and Sista arrive soon the primitive
>>     shouldn't be faster than the optimised Smalltalk code.
>>     --
>>     best,
>>     Eliot
>
>     BTW, any good reason for not prefixing all the implementors of #=
>     with this?
>
>     "Any object is equal to itself"
>     self == argument ifTrue: [ ^ true ].
>
>
> It doesn't make much difference:
>
> | bs t1 t2 |
> bs := ByteString allInstances first: 10000.
> t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
> t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> { t1. t2 } #(4628 4560)
>
> 4560 - 4628 / 46.28 -1.47%
>
> So is it worth it?  If you feel it is I've no objection other than it
> feels a little kludgey for such little benefit.  And there are the
> Symbols if one needs quick comparison and can bear the cost of slow
> interning.
> --
> best,
> Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Re: [squeak-dev] Re: String >> #=

Chris Muller-3
In reply to this post by Eliot Miranda-2
> On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists) <[hidden email]> wrote:
>>
>> Quoting Eliot Miranda <[hidden email]>:
>>
>> Hi Phillipe,
>>
>>
>> On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall <[hidden email]> wrote:
>>>
>>> Hi
>>>
>>> I have been investigating why Dictionary look up performance with String keys is not as good as I would expected. Something I noted is that String >> #= is implemented in terms of #compare:with:collated:. There is no short circuit if Strings are not the same size. In my case some Strings have the same prefix but a different length eg 'Content-Type' and 'Content-Length'. In that case a #compare:with:collated: is performed even though we know in advance the answer will be false because they have different sizes.
>>
>>
>> Why not rewrite
>>
>> String>>= aString
>> "Answer whether the receiver sorts equally as aString.
>> The collation order is simple ascii (with case differences)."
>> aString isString ifFalse: [ ^ false ].
>> ^ (self compare: self with: aString collated: AsciiOrder) = 2
>>
>> as
>>
>> String>>= aString
>> "Answer whether the receiver sorts equally as aString.
>> The collation order is simple ascii (with case differences)."
>>
>> (aString isString
>> and: [self size = aString size]) ifFalse: [^false].
>> ^ (self compare: self withSize: with: aString collated: AsciiOrder) = 2
>>
>> ?
>
>
> This makes a huge difference, over 3 times faster:
>
> | bs t1 t2 |
> bs := ByteString allInstances first: 10000.
> t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
> t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> { t1. t2 } #(13726 4467)
> 4467 - 13726 / 137.26 -67.46%
>>
>>
>> One /could/ add a replacement compare:with:collated: primitive primitiveCompareString which took the sizes as arguments to avoid asking twice.  But it wouldn't be safe.  One could abuse the primitive and lie about the size.  So I suspect it is best to add the size check to String>>#= and accept the duplication of the primitive finding the sizes of the two strings.  The cost in the primitive is minimal.  A WideString version of the primitive might pay its way, but if Spur and Sista arrive soon the primitive shouldn't be faster than the optimised Smalltalk code.
>> --
>> best,
>> Eliot
>>
>> BTW, any good reason for not prefixing all the implementors of #= with this?
>>
>> "Any object is equal to itself"
>> self == argument ifTrue: [ ^ true ].
>
>
> It doesn't make much difference:
>
> | bs t1 t2 |
> bs := ByteString allInstances first: 10000.
> t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
> t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> { t1. t2 } #(4628 4560)
>
> 4560 - 4628 / 46.28 -1.47%
>
> So is it worth it?  If you feel it is I've no objection other than it feels a little kludgey for such little benefit.

That is a very common convention in the image, not kludgey.
String>>#= should advertise, by its implementation, its desire and
intent for maximum performance.  Not covering this check conveys a
lack of thoroghness and/or dedication to performance.  Future readers
of the method will wonder why such a simple avoidance of unnecessary
processing for that case wasn't taken.  By the time they wonder
whether it was an oversight they've already expended too much thought
about it.

We should really consider Andres' question before popping this into
trunk though..

> And there are the Symbols if one needs quick comparison and can bear the cost of slow interning.



> --
> best,
> Eliot
>

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: String >> #=

Andres Valloud-4
In reply to this post by Eliot Miranda-2
String encoding is perpendicular to my point.  I'm referring to
canonical equivalence as defined in section 1.1 of the document
referenced by the URL I sent.  For instance, the Hangul example in the
first table shows that a combination of two characters (regardless of
encoding) is to be considered canonically equivalent to a single
character.  From the document (which claims to be Unicode Standard Annex
#15),

"Canonical equivalence is a fundamental equivalency between characters
or sequences of characters that represent the same abstract character,
and when correctly displayed should always have the same visual
appearance and behavior."

How do you propose that a size check is appropriate in the presence of
canonical equivalence?  What is string equivalence supposed to mean?  I
think more attention should be given to those questions.

On 5/27/14 18:53 , Eliot Miranda wrote:

> Hi Andres,
>
>
> On Tue, May 27, 2014 at 6:10 PM, Andres Valloud
> <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     What is going to happen when one compares two general Unicode series
>     of characters that represent the same string but differ in
>     normalization? Wouldn't the size test would result in false negatives?
>
>     http://unicode.org/reports/__tr15/ <http://unicode.org/reports/tr15/>
>
>     I'm asking because I haven't seen any discussion on the subject, and
>     the decision to change the code as proposed could have side effects.
>
>
> The issue is whether String supports variable-sized encodings such as
> UTF-8 where there is no fixed relationship between the number of bytes i
> a string and the number of characters in a string.  Right now we have
> ByteString and WideString.  ByteString has 1 byte per character.
>   WideString has 4 bytes per character.  So 'hello' asByteString
> contains 5 bytes and has size 5, but 'hello' asWideString contains 20
> bytes and also has size 5.  Hence the size check is fine, since size
> answers the number of characters, not the number of bytes.  If we were
> to add a UTF8String we'd have to delete the size check.  But I think for
> now we're not going to do that.
>
> A ByteString can contain some characters that comprise a UTF-8 string
> (see UTF8TextCoverter) but that's a convention of usage.  if you print
> some ByteString containing the UTF-8 encoding of a string containing
> characters that take more than one byte to encode, that string won't
> print as the input, it'll print treating each byte as a character, and
> so will scramble the string.  It is up to the user to handle these
> ByteStrings that happen by convention to contain UTF-8 correctly.
>
> Note that there is nothing to stop us adding a UTF8String *provided*
> that class implements size to answer the number of characters, not the
> number of bytes.  My understanding is that VW takes this approach also.
> File streams expose the encoding, sicne position is a byte position, not
> a character position, and so it is up to the file stream client to cope
> with the positioning complexities that this introduces, not the stream.
>
>
> OK?
>
>
>
>     On 5/27/14 11:59 , Eliot Miranda wrote:
>
>
>
>
>         On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists)
>         <[hidden email] <mailto:[hidden email]>
>         <mailto:juanlists@jvuletich.__org
>         <mailto:[hidden email]>>> wrote:
>
>              __
>
>              Quoting Eliot Miranda <[hidden email]
>         <mailto:[hidden email]>
>              <mailto:eliot.miranda@gmail.__com
>         <mailto:[hidden email]>>>:
>
>                  Hi Phillipe,
>
>
>                  On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall
>                  <philippe.marschall@netcetera.__ch
>             <mailto:[hidden email]>
>                  <mailto:[hidden email]
>             <mailto:[hidden email]>>> wrote:
>
>                      Hi
>
>                      I have been investigating why Dictionary look up
>             performance
>                      with String keys is not as good as I would
>             expected. Something
>                      I noted is that String >> #= is implemented in terms of
>                      #compare:with:collated:. There is no short circuit
>             if Strings
>                      are not the same size. In my case some Strings have
>             the same
>                      prefix but a different length eg 'Content-Type' and
>                      'Content-Length'. In that case a
>             #compare:with:collated: is
>                      performed even though we know in advance the answer
>             will be
>                      false because they have different sizes.
>
>                  Why not rewrite
>                  String>>= aString
>                  "Answer whether the receiver sorts equally as aString.
>                  The collation order is simple ascii (with case
>             differences)."
>                  aString isString ifFalse: [ ^ false ].
>                  ^ (self compare: self with: aString collated:
>             AsciiOrder) = 2
>                  as
>                  String>>= aString
>                  "Answer whether the receiver sorts equally as aString.
>                  The collation order is simple ascii (with case
>             differences)."
>                  (aString isString
>                  and: [self size = aString size]) ifFalse: [^false].
>                  ^ (self compare: self withSize: with: aString collated:
>                  AsciiOrder) = 2
>                  ?
>
>
>
>         This makes a huge difference, over 3 times faster:
>
>         | bs t1 t2 |
>         bs := ByteString allInstances first: 10000.
>         t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
>         (FileStream fileNamed:
>         '/Users/eliot/Squeak/Squeak4.__5/String-=.st') fileIn.
>         t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
>         { t1. t2 } #(13726 4467)
>         4467 - 13726 / 137.26 -67.46%
>
>                  One /could/ add a replacement compare:with:collated:
>                  primitive primitiveCompareString which took the sizes
>             as arguments
>                  to avoid asking twice.  But it wouldn't be safe.  One
>             could abuse
>                  the primitive and lie about the size.  So I suspect it
>             is best to
>                  add the size check to String>>#= and accept the
>             duplication of
>                  the primitive finding the sizes of the two strings.
>               The cost in
>                  the primitive is minimal.  A WideString version of the
>             primitive
>                  might pay its way, but if Spur and Sista arrive soon
>             the primitive
>                  shouldn't be faster than the optimised Smalltalk code.
>                  --
>                  best,
>                  Eliot
>
>
>              BTW, any good reason for not prefixing all the implementors
>         of #=
>              with this?
>
>              "Any object is equal to itself"
>              self == argument ifTrue: [ ^ true ].
>
>
>         It doesn't make much difference:
>
>         | bs t1 t2 |
>         bs := ByteString allInstances first: 10000.
>         t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
>         (FileStream fileNamed:
>         '/Users/eliot/Squeak/Squeak4.__5/String-=.st') fileIn.
>         t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
>         { t1. t2 } #(4628 4560)
>
>         4560 - 4628 / 46.28 -1.47%
>
>         So is it worth it?  If you feel it is I've no objection other
>         than it
>         feels a little kludgey for such little benefit.  And there are the
>         Symbols if one needs quick comparison and can bear the cost of slow
>         interning.
>         --
>         best,
>         Eliot
>
>
>
>
>
> --
> best,
> Eliot

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: String >> #=

Philippe Marschall-2
In reply to this post by Andres Valloud-4
On 28.05.14 03:11, Andres Valloud wrote:
> What is going to happen when one compares two general Unicode series of
> characters that represent the same string but differ in normalization?
> Wouldn't the size test would result in false negatives?

Yes but #= is blissfully unaware of normalization in Squeak/Pharo. In
fact AFAIK Squeak/Pharo is unaware of normalization. Having a short look
at it doesn't even look as if case insensitivity worked in Squeak/Pharo
outside of Latin-1 (I could be wrong though).

In addition you probably don't want #= to do normalization "because
performance". And even if you did you probably still want a fast path
for ByteString receiver and ByteString argument in which case #size is safe.

Cheers
Philippe



Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: String >> #=

Andres Valloud-4
Hey Philippe,

> Yes but #= is blissfully unaware of normalization in Squeak/Pharo. In
> fact AFAIK Squeak/Pharo is unaware of normalization. Having a short look
> at it doesn't even look as if case insensitivity worked in Squeak/Pharo
> outside of Latin-1 (I could be wrong though).

Yes, that's what I am thinking about.  To be more explicit, suppose
"Unicode" series of characters got into the image via the keyboard, a
file, a socket... once decoded, what could one do with them?  Are all
types of decoded character series going to be represented as instances
of a single class, although they have inherently different behavior?

> In addition you probably don't want #= to do normalization "because
> performance". And even if you did you probably still want a fast path
> for ByteString receiver and ByteString argument in which case #size is safe.

Assuming all fixed width representation strings (e.g. byte strings) will
always have the same encoding (e.g. same code page), then the size check
for those seems ok to me.

Just to make sure, I am not celebrating all this complexity in the
world... however, given that it's there, how are we going to deal with
it?  I'm concerned about the long term consequences of making things
more complex than they are by reinterpreting them.  The problem I see is
that ultimately programs just won't Work(TM).

Andres.

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: String >> #=

Nicolas Cellier

2014-05-28 22:24 GMT+02:00 Andres Valloud <[hidden email]>:
Hey Philippe,


Yes but #= is blissfully unaware of normalization in Squeak/Pharo. In
fact AFAIK Squeak/Pharo is unaware of normalization. Having a short look
at it doesn't even look as if case insensitivity worked in Squeak/Pharo
outside of Latin-1 (I could be wrong though).

Yes, that's what I am thinking about.  To be more explicit, suppose "Unicode" series of characters got into the image via the keyboard, a file, a socket... once decoded, what could one do with them?  Are all types of decoded character series going to be represented as instances of a single class, although they have inherently different behavior?


I've got some changes pending to use Unicode for case insensitivity in Squeak, if it's urgent i can publish now...
But since Eliot is working on Character representation in Spur, i decided to wait a little bit.

Normalization is really a complex beast. There was an attempt by Yoshiki (CombinedChar) to correctly render the most simple composition sequences in now dead MultiCharacter* classes.
We decided to remove this incomplete support and postponed the rewrite for later...
(since we would need canonical representation in other places, it's better to separate it...)

The CombinedChar utility is still there but unused in rendering.
It is used in UnicodeCompositionStream which is used in ParagraphEditor>>readKeyboard (yes the st80 one!)

Pharo 3.0 has a UTF8DecomposedTextConverter, but that's far less general, is using CombinedChar and is unused in base.
 

In addition you probably don't want #= to do normalization "because
performance". And even if you did you probably still want a fast path
for ByteString receiver and ByteString argument in which case #size is safe.

Assuming all fixed width representation strings (e.g. byte strings) will always have the same encoding (e.g. same code page), then the size check for those seems ok to me.

Just to make sure, I am not celebrating all this complexity in the world... however, given that it's there, how are we going to deal with it?  I'm concerned about the long term consequences of making things more complex than they are by reinterpreting them.  The problem I see is that ultimately programs just won't Work(TM).

Andres.

 
Sure, Unicode is aggregating centuries of history from the whole word, how could it be simple...
Our support is really bare minimum.
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: String >> #=

Philippe Marschall-2
In reply to this post by Andres Valloud-4
On 28.05.14 22:24, Andres Valloud wrote:

> Hey Philippe,
>
>> Yes but #= is blissfully unaware of normalization in Squeak/Pharo. In
>> fact AFAIK Squeak/Pharo is unaware of normalization. Having a short look
>> at it doesn't even look as if case insensitivity worked in Squeak/Pharo
>> outside of Latin-1 (I could be wrong though).
>
> Yes, that's what I am thinking about.  To be more explicit, suppose
> "Unicode" series of characters got into the image via the keyboard, a
> file, a socket... once decoded, what could one do with them?  Are all
> types of decoded character series going to be represented as instances
> of a single class, although they have inherently different behavior?

I don't understand the question.

>> In addition you probably don't want #= to do normalization "because
>> performance". And even if you did you probably still want a fast path
>> for ByteString receiver and ByteString argument in which case #size is
>> safe.
>
> Assuming all fixed width representation strings (e.g. byte strings) will
> always have the same encoding (e.g. same code page), then the size check
> for those seems ok to me.

All Strings are fixed width in Pharo/Squeak. If you have a single
non-Latin1 character (code point) in a String all characters in the
String are promoted from a byte to an OOP. #size then answers the number
of OOPs instead of the number of bytes. So #size always answers the
number of characters (code points) non-normalized (because there is no
way to do normalization in Pharo/Squeak).

> Just to make sure, I am not celebrating all this complexity in the
> world... however, given that it's there, how are we going to deal with
> it?  I'm concerned about the long term consequences of making things
> more complex than they are by reinterpreting them.  The problem I see is
> that ultimately programs just won't Work(TM).

This seems to me a more general discussion than the problem at hand.
Again at the moment Pharo/Squeak is largely unaware of Unicode. It
supports very large code points but without any semantics (similar to
let's say Erlang with doesn't even have a String type). What
Pharo/Squeak however does know about is encodings for mapping the
Strings to and from bytes.

And quite honestly doing normalization in #= may cause things to just
won't Work(TM). Consider for example an HTTP request with a query string
with two different query fields that are normalized equal. Would you
want the values to be stored under one dictionary key?

Cheers
Philippe