Hi
I have been investigating why Dictionary look up performance with String keys is not as good as I would expected. Something I noted is that String >> #= is implemented in terms of #compare:with:collated:. There is no short circuit if Strings are not the same size. In my case some Strings have the same prefix but a different length eg 'Content-Type' and 'Content-Length'. In that case a #compare:with:collated: is performed even though we know in advance the answer will be false because they have different sizes. Cheers Philippe |
Hello, for this use case people typically use Symbols (much faster). Now you are right we may add a shortcut in string comparison. 2014-05-26 9:51 GMT+02:00 Philippe Marschall <[hidden email]>: Hi |
On 26.05.14 11:08, Clément Bera wrote:
> Hello, > > for this use case people typically use Symbols (much faster). I don't see how that helps because #asSymbol does a look up into a dictionary with a String key as well. Cheers Philippe |
In reply to this post by Philippe Marschall-2
Hi Phillipe,
On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall <[hidden email]> wrote: Hi Why not rewrite String>>= aString "Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)." aString isString ifFalse: [ ^ false ].
^ (self compare: self with: aString collated: AsciiOrder) = 2 as String>>= aString "Answer whether the receiver sorts equally as aString.
The collation order is simple ascii (with case differences)." (aString isString and: [self size = aString size]) ifFalse: [^false]. ^ (self compare: self withSize: with: aString collated: AsciiOrder) = 2 ? -- best, Eliot
|
On 27.05.14 06:45, Eliot Miranda wrote:
> > > > > Hi Phillipe, > > > On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall > <[hidden email] > <mailto:[hidden email]>> wrote: > > Hi > > I have been investigating why Dictionary look up performance with > String keys is not as good as I would expected. Something I noted is > that String >> #= is implemented in terms of > #compare:with:collated:. There is no short circuit if Strings are > not the same size. In my case some Strings have the same prefix but > a different length eg 'Content-Type' and 'Content-Length'. In that > case a #compare:with:collated: is performed even though we know in > advance the answer will be false because they have different sizes. > > > Why not rewrite Yes, that would be my proposal as well. Cheers Philippe |
On 27 May 2014, at 09:44, Philippe Marschall <[hidden email]> wrote: > On 27.05.14 06:45, Eliot Miranda wrote: >> >> >> >> >> Hi Phillipe, >> >> >> On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall >> <[hidden email] >> <mailto:[hidden email]>> wrote: >> >> Hi >> >> I have been investigating why Dictionary look up performance with >> String keys is not as good as I would expected. Something I noted is >> that String >> #= is implemented in terms of >> #compare:with:collated:. There is no short circuit if Strings are >> not the same size. In my case some Strings have the same prefix but >> a different length eg 'Content-Type' and 'Content-Length'. In that >> case a #compare:with:collated: is performed even though we know in >> advance the answer will be false because they have different sizes. >> >> >> Why not rewrite > > Yes, that would be my proposal as well. > Ok, I have added an issue: https://pharo.fogbugz.com/f/cases/13277/Speed-up-String Marcus |
In reply to this post by Philippe Marschall-2
On 26.05.14 09:51, Philippe Marschall wrote:
> Hi > > I have been investigating why Dictionary look up performance with String > keys is not as good as I would expected. Something I noted is that > String >> #= is implemented in terms of #compare:with:collated:. There > is no short circuit if Strings are not the same size. In my case some > Strings have the same prefix but a different length eg 'Content-Type' > and 'Content-Length'. In that case a #compare:with:collated: is > performed even though we know in advance the answer will be false > because they have different sizes. Some follow up on this: In Seaside we have a custom dictionary GRSmallDictionary which is more or less an OrderedCollection of Associations (not actually but conceptually). I did a microbenchmark with 11 keys [1] and out of the box Dictionary is about twice as fast as GRSmallDictionary. However when I change GRSmallDictionary to first send and compare #size before sending #= then GRSmallDictionary is slightly faster than Dictionary. [1] #('Content-Type' 'Content-Language' 'Content-Length' 'Date' 'Last-Modified' 'Location' 'Set-Cookie' 'Set-Cookie2' 'Servlet-Engine' 'Status' 'WWW-Authenticate') Cheers Philippe |
In reply to this post by Eliot Miranda-2
Quoting Eliot Miranda <[hidden email]>:
BTW, any good reason for not prefixing all the implementors of #= with this? Cheers,
Juan Vuletich |
On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists) <[hidden email]> wrote: --
This makes a huge difference, over 3 times faster: | bs t1 t2 | bs := ByteString allInstances first: 10000.
t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn. t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.
{ t1. t2 } #(13726 4467) 4467 - 13726 / 137.26 -67.46%
It doesn't make much difference: | bs t1 t2 | bs := ByteString allInstances first: 10000.
t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn. t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. { t1. t2 } #(4628 4560)
4560 - 4628 / 46.28 -1.47% So is it worth it? If you feel it is I've no objection other than it feels a little kludgey for such little benefit. And there are the Symbols if one needs quick comparison and can bear the cost of slow interning.
best, Eliot
|
What is going to happen when one compares two general Unicode series of
characters that represent the same string but differ in normalization? Wouldn't the size test would result in false negatives? http://unicode.org/reports/tr15/ I'm asking because I haven't seen any discussion on the subject, and the decision to change the code as proposed could have side effects. On 5/27/14 11:59 , Eliot Miranda wrote: > > > > On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists) > <[hidden email] <mailto:[hidden email]>> wrote: > > __ > > Quoting Eliot Miranda <[hidden email] > <mailto:[hidden email]>>: > >> Hi Phillipe, >> >> >> On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall >> <[hidden email] >> <mailto:[hidden email]>> wrote: >> >> Hi >> >> I have been investigating why Dictionary look up performance >> with String keys is not as good as I would expected. Something >> I noted is that String >> #= is implemented in terms of >> #compare:with:collated:. There is no short circuit if Strings >> are not the same size. In my case some Strings have the same >> prefix but a different length eg 'Content-Type' and >> 'Content-Length'. In that case a #compare:with:collated: is >> performed even though we know in advance the answer will be >> false because they have different sizes. >> >> Why not rewrite >> String>>= aString >> "Answer whether the receiver sorts equally as aString. >> The collation order is simple ascii (with case differences)." >> aString isString ifFalse: [ ^ false ]. >> ^ (self compare: self with: aString collated: AsciiOrder) = 2 >> as >> String>>= aString >> "Answer whether the receiver sorts equally as aString. >> The collation order is simple ascii (with case differences)." >> (aString isString >> and: [self size = aString size]) ifFalse: [^false]. >> ^ (self compare: self withSize: with: aString collated: >> AsciiOrder) = 2 >> ? > > > This makes a huge difference, over 3 times faster: > > | bs t1 t2 | > bs := ByteString allInstances first: 10000. > t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn. > t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > { t1. t2 } #(13726 4467) > 4467 - 13726 / 137.26 -67.46% > >> One /could/ add a replacement compare:with:collated: >> primitive primitiveCompareString which took the sizes as arguments >> to avoid asking twice. But it wouldn't be safe. One could abuse >> the primitive and lie about the size. So I suspect it is best to >> add the size check to String>>#= and accept the duplication of >> the primitive finding the sizes of the two strings. The cost in >> the primitive is minimal. A WideString version of the primitive >> might pay its way, but if Spur and Sista arrive soon the primitive >> shouldn't be faster than the optimised Smalltalk code. >> -- >> best, >> Eliot > > BTW, any good reason for not prefixing all the implementors of #= > with this? > > "Any object is equal to itself" > self == argument ifTrue: [ ^ true ]. > > > It doesn't make much difference: > > | bs t1 t2 | > bs := ByteString allInstances first: 10000. > t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn. > t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > { t1. t2 } #(4628 4560) > > 4560 - 4628 / 46.28 -1.47% > > So is it worth it? If you feel it is I've no objection other than it > feels a little kludgey for such little benefit. And there are the > Symbols if one needs quick comparison and can bear the cost of slow > interning. > -- > best, > Eliot |
In reply to this post by Eliot Miranda-2
> On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists) <[hidden email]> wrote:
>> >> Quoting Eliot Miranda <[hidden email]>: >> >> Hi Phillipe, >> >> >> On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall <[hidden email]> wrote: >>> >>> Hi >>> >>> I have been investigating why Dictionary look up performance with String keys is not as good as I would expected. Something I noted is that String >> #= is implemented in terms of #compare:with:collated:. There is no short circuit if Strings are not the same size. In my case some Strings have the same prefix but a different length eg 'Content-Type' and 'Content-Length'. In that case a #compare:with:collated: is performed even though we know in advance the answer will be false because they have different sizes. >> >> >> Why not rewrite >> >> String>>= aString >> "Answer whether the receiver sorts equally as aString. >> The collation order is simple ascii (with case differences)." >> aString isString ifFalse: [ ^ false ]. >> ^ (self compare: self with: aString collated: AsciiOrder) = 2 >> >> as >> >> String>>= aString >> "Answer whether the receiver sorts equally as aString. >> The collation order is simple ascii (with case differences)." >> >> (aString isString >> and: [self size = aString size]) ifFalse: [^false]. >> ^ (self compare: self withSize: with: aString collated: AsciiOrder) = 2 >> >> ? > > > This makes a huge difference, over 3 times faster: > > | bs t1 t2 | > bs := ByteString allInstances first: 10000. > t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn. > t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > { t1. t2 } #(13726 4467) > 4467 - 13726 / 137.26 -67.46% >> >> >> One /could/ add a replacement compare:with:collated: primitive primitiveCompareString which took the sizes as arguments to avoid asking twice. But it wouldn't be safe. One could abuse the primitive and lie about the size. So I suspect it is best to add the size check to String>>#= and accept the duplication of the primitive finding the sizes of the two strings. The cost in the primitive is minimal. A WideString version of the primitive might pay its way, but if Spur and Sista arrive soon the primitive shouldn't be faster than the optimised Smalltalk code. >> -- >> best, >> Eliot >> >> BTW, any good reason for not prefixing all the implementors of #= with this? >> >> "Any object is equal to itself" >> self == argument ifTrue: [ ^ true ]. > > > It doesn't make much difference: > > | bs t1 t2 | > bs := ByteString allInstances first: 10000. > t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > (FileStream fileNamed: '/Users/eliot/Squeak/Squeak4.5/String-=.st') fileIn. > t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > { t1. t2 } #(4628 4560) > > 4560 - 4628 / 46.28 -1.47% > > So is it worth it? If you feel it is I've no objection other than it feels a little kludgey for such little benefit. That is a very common convention in the image, not kludgey. String>>#= should advertise, by its implementation, its desire and intent for maximum performance. Not covering this check conveys a lack of thoroghness and/or dedication to performance. Future readers of the method will wonder why such a simple avoidance of unnecessary processing for that case wasn't taken. By the time they wonder whether it was an oversight they've already expended too much thought about it. We should really consider Andres' question before popping this into trunk though.. > And there are the Symbols if one needs quick comparison and can bear the cost of slow interning. > -- > best, > Eliot > |
In reply to this post by Eliot Miranda-2
String encoding is perpendicular to my point. I'm referring to
canonical equivalence as defined in section 1.1 of the document referenced by the URL I sent. For instance, the Hangul example in the first table shows that a combination of two characters (regardless of encoding) is to be considered canonically equivalent to a single character. From the document (which claims to be Unicode Standard Annex #15), "Canonical equivalence is a fundamental equivalency between characters or sequences of characters that represent the same abstract character, and when correctly displayed should always have the same visual appearance and behavior." How do you propose that a size check is appropriate in the presence of canonical equivalence? What is string equivalence supposed to mean? I think more attention should be given to those questions. On 5/27/14 18:53 , Eliot Miranda wrote: > Hi Andres, > > > On Tue, May 27, 2014 at 6:10 PM, Andres Valloud > <[hidden email] > <mailto:[hidden email]>> wrote: > > What is going to happen when one compares two general Unicode series > of characters that represent the same string but differ in > normalization? Wouldn't the size test would result in false negatives? > > http://unicode.org/reports/__tr15/ <http://unicode.org/reports/tr15/> > > I'm asking because I haven't seen any discussion on the subject, and > the decision to change the code as proposed could have side effects. > > > The issue is whether String supports variable-sized encodings such as > UTF-8 where there is no fixed relationship between the number of bytes i > a string and the number of characters in a string. Right now we have > ByteString and WideString. ByteString has 1 byte per character. > WideString has 4 bytes per character. So 'hello' asByteString > contains 5 bytes and has size 5, but 'hello' asWideString contains 20 > bytes and also has size 5. Hence the size check is fine, since size > answers the number of characters, not the number of bytes. If we were > to add a UTF8String we'd have to delete the size check. But I think for > now we're not going to do that. > > A ByteString can contain some characters that comprise a UTF-8 string > (see UTF8TextCoverter) but that's a convention of usage. if you print > some ByteString containing the UTF-8 encoding of a string containing > characters that take more than one byte to encode, that string won't > print as the input, it'll print treating each byte as a character, and > so will scramble the string. It is up to the user to handle these > ByteStrings that happen by convention to contain UTF-8 correctly. > > Note that there is nothing to stop us adding a UTF8String *provided* > that class implements size to answer the number of characters, not the > number of bytes. My understanding is that VW takes this approach also. > File streams expose the encoding, sicne position is a byte position, not > a character position, and so it is up to the file stream client to cope > with the positioning complexities that this introduces, not the stream. > > > OK? > > > > On 5/27/14 11:59 , Eliot Miranda wrote: > > > > > On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists) > <[hidden email] <mailto:[hidden email]> > <mailto:juanlists@jvuletich.__org > <mailto:[hidden email]>>> wrote: > > __ > > Quoting Eliot Miranda <[hidden email] > <mailto:[hidden email]> > <mailto:eliot.miranda@gmail.__com > <mailto:[hidden email]>>>: > > Hi Phillipe, > > > On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall > <philippe.marschall@netcetera.__ch > <mailto:[hidden email]> > <mailto:[hidden email] > <mailto:[hidden email]>>> wrote: > > Hi > > I have been investigating why Dictionary look up > performance > with String keys is not as good as I would > expected. Something > I noted is that String >> #= is implemented in terms of > #compare:with:collated:. There is no short circuit > if Strings > are not the same size. In my case some Strings have > the same > prefix but a different length eg 'Content-Type' and > 'Content-Length'. In that case a > #compare:with:collated: is > performed even though we know in advance the answer > will be > false because they have different sizes. > > Why not rewrite > String>>= aString > "Answer whether the receiver sorts equally as aString. > The collation order is simple ascii (with case > differences)." > aString isString ifFalse: [ ^ false ]. > ^ (self compare: self with: aString collated: > AsciiOrder) = 2 > as > String>>= aString > "Answer whether the receiver sorts equally as aString. > The collation order is simple ascii (with case > differences)." > (aString isString > and: [self size = aString size]) ifFalse: [^false]. > ^ (self compare: self withSize: with: aString collated: > AsciiOrder) = 2 > ? > > > > This makes a huge difference, over 3 times faster: > > | bs t1 t2 | > bs := ByteString allInstances first: 10000. > t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > (FileStream fileNamed: > '/Users/eliot/Squeak/Squeak4.__5/String-=.st') fileIn. > t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > { t1. t2 } #(13726 4467) > 4467 - 13726 / 137.26 -67.46% > > One /could/ add a replacement compare:with:collated: > primitive primitiveCompareString which took the sizes > as arguments > to avoid asking twice. But it wouldn't be safe. One > could abuse > the primitive and lie about the size. So I suspect it > is best to > add the size check to String>>#= and accept the > duplication of > the primitive finding the sizes of the two strings. > The cost in > the primitive is minimal. A WideString version of the > primitive > might pay its way, but if Spur and Sista arrive soon > the primitive > shouldn't be faster than the optimised Smalltalk code. > -- > best, > Eliot > > > BTW, any good reason for not prefixing all the implementors > of #= > with this? > > "Any object is equal to itself" > self == argument ifTrue: [ ^ true ]. > > > It doesn't make much difference: > > | bs t1 t2 | > bs := ByteString allInstances first: 10000. > t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > (FileStream fileNamed: > '/Users/eliot/Squeak/Squeak4.__5/String-=.st') fileIn. > t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun. > { t1. t2 } #(4628 4560) > > 4560 - 4628 / 46.28 -1.47% > > So is it worth it? If you feel it is I've no objection other > than it > feels a little kludgey for such little benefit. And there are the > Symbols if one needs quick comparison and can bear the cost of slow > interning. > -- > best, > Eliot > > > > > > -- > best, > Eliot |
In reply to this post by Andres Valloud-4
On 28.05.14 03:11, Andres Valloud wrote:
> What is going to happen when one compares two general Unicode series of > characters that represent the same string but differ in normalization? > Wouldn't the size test would result in false negatives? Yes but #= is blissfully unaware of normalization in Squeak/Pharo. In fact AFAIK Squeak/Pharo is unaware of normalization. Having a short look at it doesn't even look as if case insensitivity worked in Squeak/Pharo outside of Latin-1 (I could be wrong though). In addition you probably don't want #= to do normalization "because performance". And even if you did you probably still want a fast path for ByteString receiver and ByteString argument in which case #size is safe. Cheers Philippe |
Hey Philippe,
> Yes but #= is blissfully unaware of normalization in Squeak/Pharo. In > fact AFAIK Squeak/Pharo is unaware of normalization. Having a short look > at it doesn't even look as if case insensitivity worked in Squeak/Pharo > outside of Latin-1 (I could be wrong though). Yes, that's what I am thinking about. To be more explicit, suppose "Unicode" series of characters got into the image via the keyboard, a file, a socket... once decoded, what could one do with them? Are all types of decoded character series going to be represented as instances of a single class, although they have inherently different behavior? > In addition you probably don't want #= to do normalization "because > performance". And even if you did you probably still want a fast path > for ByteString receiver and ByteString argument in which case #size is safe. Assuming all fixed width representation strings (e.g. byte strings) will always have the same encoding (e.g. same code page), then the size check for those seems ok to me. Just to make sure, I am not celebrating all this complexity in the world... however, given that it's there, how are we going to deal with it? I'm concerned about the long term consequences of making things more complex than they are by reinterpreting them. The problem I see is that ultimately programs just won't Work(TM). Andres. |
2014-05-28 22:24 GMT+02:00 Andres Valloud <[hidden email]>: Sure, Unicode is aggregating centuries of history from the whole word, how could it be simple...Hey Philippe, I've got some changes pending to use Unicode for case insensitivity in Squeak, if it's urgent i can publish now... But since Eliot is working on Character representation in Spur, i decided to wait a little bit. Normalization is really a complex beast. There was an attempt by Yoshiki (CombinedChar) to correctly render the most simple composition sequences in now dead MultiCharacter* classes. We decided to remove this incomplete support and postponed the rewrite for later... (since we would need canonical representation in other places, it's better to separate it...) The CombinedChar utility is still there but unused in rendering. It is used in UnicodeCompositionStream which is used in ParagraphEditor>>readKeyboard (yes the st80 one!) Pharo 3.0 has a UTF8DecomposedTextConverter, but that's far less general, is using CombinedChar and is unused in base.
Our support is really bare minimum.
|
In reply to this post by Andres Valloud-4
On 28.05.14 22:24, Andres Valloud wrote:
> Hey Philippe, > >> Yes but #= is blissfully unaware of normalization in Squeak/Pharo. In >> fact AFAIK Squeak/Pharo is unaware of normalization. Having a short look >> at it doesn't even look as if case insensitivity worked in Squeak/Pharo >> outside of Latin-1 (I could be wrong though). > > Yes, that's what I am thinking about. To be more explicit, suppose > "Unicode" series of characters got into the image via the keyboard, a > file, a socket... once decoded, what could one do with them? Are all > types of decoded character series going to be represented as instances > of a single class, although they have inherently different behavior? I don't understand the question. >> In addition you probably don't want #= to do normalization "because >> performance". And even if you did you probably still want a fast path >> for ByteString receiver and ByteString argument in which case #size is >> safe. > > Assuming all fixed width representation strings (e.g. byte strings) will > always have the same encoding (e.g. same code page), then the size check > for those seems ok to me. All Strings are fixed width in Pharo/Squeak. If you have a single non-Latin1 character (code point) in a String all characters in the String are promoted from a byte to an OOP. #size then answers the number of OOPs instead of the number of bytes. So #size always answers the number of characters (code points) non-normalized (because there is no way to do normalization in Pharo/Squeak). > Just to make sure, I am not celebrating all this complexity in the > world... however, given that it's there, how are we going to deal with > it? I'm concerned about the long term consequences of making things > more complex than they are by reinterpreting them. The problem I see is > that ultimately programs just won't Work(TM). This seems to me a more general discussion than the problem at hand. Again at the moment Pharo/Squeak is largely unaware of Unicode. It supports very large code points but without any semantics (similar to let's say Erlang with doesn't even have a String type). What Pharo/Squeak however does know about is encodings for mapping the Strings to and from bytes. And quite honestly doing normalization in #= may cause things to just won't Work(TM). Consider for example an HTTP request with a query string with two different query fields that are normalized equal. Would you want the values to be stored under one dictionary key? Cheers Philippe |
Free forum by Nabble | Edit this page |