Re: [Pharo-dev] String >> #=

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] String >> #=

Eliot Miranda-2
 
Hi Phillipe,


On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall <[hidden email]> wrote:
Hi

I have been investigating why Dictionary look up performance with String keys is not as good as I would expected. Something I noted is that String >> #= is implemented in terms of #compare:with:collated:. There is no short circuit if Strings are not the same size. In my case some Strings have the same prefix but a different length eg 'Content-Type' and 'Content-Length'. In that case a #compare:with:collated: is performed even though we know in advance the answer will be false because they have different sizes.

Why not rewrite

String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."
aString isString ifFalse: [ ^ false ].
^ (self compare: self with: aString collated: AsciiOrder) = 2

as

String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."

(aString isString
and: [self size = aString size]) ifFalse: [^false].
^ (self compare: self withSize: with: aString collated: AsciiOrder) = 2

?

One /could/ add a replacement compare:with:collated: primitive primitiveCompareString which took the sizes as arguments to avoid asking twice.  But it wouldn't be safe.  One could abuse the primitive and lie about the size.  So I suspect it is best to add the size check to String>>#= and accept the duplication of the primitive finding the sizes of the two strings.  The cost in the primitive is minimal.  A WideString version of the primitive might pay its way, but if Spur and Sista arrive soon the primitive shouldn't be faster than the optimised Smalltalk code.
--
best,
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: [Pharo-dev] String >> #=

J. Vuletich (mail lists)
 

Quoting Eliot Miranda <[hidden email]>:

Hi Phillipe,


On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall <[hidden email]> wrote:

Hi

I have been investigating why Dictionary look up performance with String keys is not as good as I would expected. Something I noted is that String >> #= is implemented in terms of #compare:with:collated:. There is no short circuit if Strings are not the same size. In my case some Strings have the same prefix but a different length eg 'Content-Type' and 'Content-Length'. In that case a #compare:with:collated: is performed even though we know in advance the answer will be false because they have different sizes.

 
Why not rewrite
 
String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."
aString isString ifFalse: [ ^ false ].
^ (self compare: self with: aString collated: AsciiOrder) = 2
 
as
 
String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."
 
(aString isString
and: [self size = aString size]) ifFalse: [^false].
^ (self compare: self withSize: with: aString collated: AsciiOrder) = 2
 
?
 
One /could/ add a replacement compare:with:collated: primitive primitiveCompareString which took the sizes as arguments to avoid asking twice.  But it wouldn't be safe.  One could abuse the primitive and lie about the size.  So I suspect it is best to add the size check to String>>#= and accept the duplication of the primitive finding the sizes of the two strings.  The cost in the primitive is minimal.  A WideString version of the primitive might pay its way, but if Spur and Sista arrive soon the primitive shouldn't be faster than the optimised Smalltalk code.
--
best,
Eliot

BTW, any good reason for not prefixing all the implementors of #= with this?

"Any object is equal to itself"
self == argument ifTrue: [ ^ true ].

Cheers,
Juan Vuletich
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: [Pharo-dev] String >> #=

Eliot Miranda-2
 



On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists) <[hidden email]> wrote:

Quoting Eliot Miranda <[hidden email]>:

Hi Phillipe,


On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall <[hidden email]> wrote:

Hi

I have been investigating why Dictionary look up performance with String keys is not as good as I would expected. Something I noted is that String >> #= is implemented in terms of #compare:with:collated:. There is no short circuit if Strings are not the same size. In my case some Strings have the same prefix but a different length eg 'Content-Type' and 'Content-Length'. In that case a #compare:with:collated: is performed even though we know in advance the answer will be false because they have different sizes.

 
Why not rewrite
 
String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."
aString isString ifFalse: [ ^ false ].
^ (self compare: self with: aString collated: AsciiOrder) = 2
 
as
 
String>>= aString 
"Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)."
 
(aString isString
and: [self size = aString size]) ifFalse: [^false].
^ (self compare: self withSize: with: aString collated: AsciiOrder) = 2
 
?

This makes a huge difference, over 3 times faster:

| bs t1 t2 |
bs := ByteString allInstances first: 10000.
t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
(FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
{ t1. t2 } #(13726 4467)
4467 - 13726 / 137.26 -67.46% 
 
One /could/ add a replacement compare:with:collated: primitive primitiveCompareString which took the sizes as arguments to avoid asking twice.  But it wouldn't be safe.  One could abuse the primitive and lie about the size.  So I suspect it is best to add the size check to String>>#= and accept the duplication of the primitive finding the sizes of the two strings.  The cost in the primitive is minimal.  A WideString version of the primitive might pay its way, but if Spur and Sista arrive soon the primitive shouldn't be faster than the optimised Smalltalk code.
--
best,
Eliot

BTW, any good reason for not prefixing all the implementors of #= with this?

"Any object is equal to itself"
self == argument ifTrue: [ ^ true ].


It doesn't make much difference:

| bs t1 t2 |
bs := ByteString allInstances first: 10000.
t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
(FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
{ t1. t2 } #(4628 4560)

4560 - 4628 / 46.28 -1.47%

So is it worth it?  If you feel it is I've no objection other than it feels a little kludgey for such little benefit.  And there are the Symbols if one needs quick comparison and can bear the cost of slow interning.
--
best,
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: [Pharo-dev] String >> #=

Chris Muller-3

> On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists) <[hidden email]> wrote:
>>
>> Quoting Eliot Miranda <[hidden email]>:
>>
>> Hi Phillipe,
>>
>>
>> On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall <[hidden email]> wrote:
>>>
>>> Hi
>>>
>>> I have been investigating why Dictionary look up performance with String keys is not as good as I would expected. Something I noted is that String >> #= is implemented in terms of #compare:with:collated:. There is no short circuit if Strings are not the same size. In my case some Strings have the same prefix but a different length eg 'Content-Type' and 'Content-Length'. In that case a #compare:with:collated: is performed even though we know in advance the answer will be false because they have different sizes.
>>
>>
>> Why not rewrite
>>
>> String>>= aString
>> "Answer whether the receiver sorts equally as aString.
>> The collation order is simple ascii (with case differences)."
>> aString isString ifFalse: [ ^ false ].
>> ^ (self compare: self with: aString collated: AsciiOrder) = 2
>>
>> as
>>
>> String>>= aString
>> "Answer whether the receiver sorts equally as aString.
>> The collation order is simple ascii (with case differences)."
>>
>> (aString isString
>> and: [self size = aString size]) ifFalse: [^false].
>> ^ (self compare: self withSize: with: aString collated: AsciiOrder) = 2
>>
>> ?
>
>
> This makes a huge difference, over 3 times faster:
>
> | bs t1 t2 |
> bs := ByteString allInstances first: 10000.
> t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
> t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> { t1. t2 } #(13726 4467)
> 4467 - 13726 / 137.26 -67.46%
>>
>>
>> One /could/ add a replacement compare:with:collated: primitive primitiveCompareString which took the sizes as arguments to avoid asking twice.  But it wouldn't be safe.  One could abuse the primitive and lie about the size.  So I suspect it is best to add the size check to String>>#= and accept the duplication of the primitive finding the sizes of the two strings.  The cost in the primitive is minimal.  A WideString version of the primitive might pay its way, but if Spur and Sista arrive soon the primitive shouldn't be faster than the optimised Smalltalk code.
>> --
>> best,
>> Eliot
>>
>> BTW, any good reason for not prefixing all the implementors of #= with this?
>>
>> "Any object is equal to itself"
>> self == argument ifTrue: [ ^ true ].
>
>
> It doesn't make much difference:
>
> | bs t1 t2 |
> bs := ByteString allInstances first: 10000.
> t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn.
> t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
> { t1. t2 } #(4628 4560)
>
> 4560 - 4628 / 46.28 -1.47%
>
> So is it worth it?  If you feel it is I've no objection other than it feels a little kludgey for such little benefit.

That is a very common convention in the image, not kludgey.
String>>#= should advertise, by its implementation, its desire and
intent for maximum performance.  Not covering this check conveys a
lack of thoroghness and/or dedication to performance.  Future readers
of the method will wonder why such a simple avoidance of unnecessary
processing for that case wasn't taken.  By the time they wonder
whether it was an oversight they've already expended too much thought
about it.

We should really consider Andres' question before popping this into
trunk though..

> And there are the Symbols if one needs quick comparison and can bear the cost of slow interning.



> --
> best,
> Eliot
>