Hi Sven and Everyone,
Instantiation of character encoders is relatively expensive, as can be seen by the following code snippet: | count str ba enc new old | count := 10000. str := 'test-äöü'. new := [ count timesRepeat: [ (ZnCharacterEncoder newForEncoding: 'utf8') encodeString: str ] ] timeToRunWithoutGC. enc := ZnCharacterEncoder newForEncoding: 'utf8'. old := [ count timesRepeat: [ enc encodeString: str ] ] timeToRunWithoutGC. { old. new. } " #(14 122)" What do you think of the idea of caching default instances of encoders? My idea was to have a dictionary in ZnCharacterEncoder and add a class method #forEncoding:. This would lazily create an instance, store it in the dictionary, and then re-use the instance whenever that encoder is requested. If people really want their own instance (not sure why, there aren't any instance variables, and so there's no state), they can use the existing #newForEncoding:. For example, sending #exists to a file reference currently instantiates two encoders. This was actually previously fixed by #20273[1], but seems to have been lost with the refactoring of FilePluginPrims to File. From memory, Cyril's application saved 12 seconds by caching the encoder. But in that case the instance was private to the file plugin primitives. This would be a more general solution. [1] https://pharo.fogbugz.com/f/cases/20273/FilePluginPrims-could-save-an-encoder-to-speedup-some-methods Thanks, Alistair |
On Tue, 28 Aug 2018 at 12:43, Alistair Grant <[hidden email]> wrote:
> > Hi Sven and Everyone, > > Instantiation of character encoders is relatively expensive, as can be > seen by the following code snippet: > > | count str ba enc new old | > > count := 10000. > str := 'test-äöü'. > new := [ > count timesRepeat: [ > (ZnCharacterEncoder newForEncoding: 'utf8') encodeString: str > ] ] timeToRunWithoutGC. > enc := ZnCharacterEncoder newForEncoding: 'utf8'. > old := [ > count timesRepeat: [ > enc encodeString: str ] ] timeToRunWithoutGC. > { old. new. } > > " #(14 122)" > > What do you think of the idea of caching default instances of encoders? > > My idea was to have a dictionary in ZnCharacterEncoder and add a class > method #forEncoding:. This would lazily create an instance, store it > in the dictionary, and then re-use the instance whenever that encoder > is requested. If people really want their own instance (not sure why, > there aren't any instance variables, and so there's no state), they > can use the existing #newForEncoding:. > > For example, sending #exists to a file reference currently > instantiates two encoders. > > This was actually previously fixed by #20273[1], but seems to have > been lost with the refactoring of FilePluginPrims to File. From > memory, Cyril's application saved 12 seconds by caching the encoder. > But in that case the instance was private to the file plugin > primitives. This would be a more general solution. > > [1] https://pharo.fogbugz.com/f/cases/20273/FilePluginPrims-could-save-an-encoder-to-speedup-some-methods Just so people don't focus on the wrong thing: I realise that the ratio (almost 10:1 above) drops quickly as the string being converted grows in length. But if we take typical file names, the ratio, while smaller, is still significant, as suggested by Cyril's anecdote. Thanks, Alistair |
In reply to this post by alistairgrant
Hi Alistair,
Thanks for the feedback, but consider: count := 10000. str := 'test-äöü'. [ count timesRepeat: [ ZnCharacterEncoder utf8 encodeString: str ] ] timeToRunWithoutGC. => 8 #newForEncoding: is indeed slow as it has to look through all known encoders, which also involves a subclasses search, which is a PITA. ZnCharacterEncoder class>>#utf8 actually uses a cached default instance (see ZnUTF8Encoder class>>#default). As this is the most used encoder, with no customization options, it makes sense. For example, #utf8Encoded and #utf8Decoded use this instance. However, some (byte, utf16, utf32) encoders do have state (endianness, strictness, ...), so global caching could be an issue. Users of the encoders are free to cache their own instances. I am not (yet) convinced that a global cache is needed. Sven > On 28 Aug 2018, at 12:43, Alistair Grant <[hidden email]> wrote: > > Hi Sven and Everyone, > > Instantiation of character encoders is relatively expensive, as can be > seen by the following code snippet: > > | count str ba enc new old | > > count := 10000. > str := 'test-äöü'. > new := [ > count timesRepeat: [ > (ZnCharacterEncoder newForEncoding: 'utf8') encodeString: str > ] ] timeToRunWithoutGC. > enc := ZnCharacterEncoder newForEncoding: 'utf8'. > old := [ > count timesRepeat: [ > enc encodeString: str ] ] timeToRunWithoutGC. > { old. new. } > > " #(14 122)" > > What do you think of the idea of caching default instances of encoders? > > My idea was to have a dictionary in ZnCharacterEncoder and add a class > method #forEncoding:. This would lazily create an instance, store it > in the dictionary, and then re-use the instance whenever that encoder > is requested. If people really want their own instance (not sure why, > there aren't any instance variables, and so there's no state), they > can use the existing #newForEncoding:. > > For example, sending #exists to a file reference currently > instantiates two encoders. > > This was actually previously fixed by #20273[1], but seems to have > been lost with the refactoring of FilePluginPrims to File. From > memory, Cyril's application saved 12 seconds by caching the encoder. > But in that case the instance was private to the file plugin > primitives. This would be a more general solution. > > [1] https://pharo.fogbugz.com/f/cases/20273/FilePluginPrims-could-save-an-encoder-to-speedup-some-methods > > Thanks, > Alistair > |
Hi Sven,
Thanks for the quick reply. On Tue, 28 Aug 2018 at 13:17, Sven Van Caekenberghe <[hidden email]> wrote: > > Hi Alistair, > > Thanks for the feedback, but consider: > > count := 10000. > str := 'test-äöü'. > [ > count timesRepeat: [ > ZnCharacterEncoder utf8 encodeString: str > ] ] timeToRunWithoutGC. > > => 8 > > #newForEncoding: is indeed slow as it has to look through all known encoders, which also involves a subclasses search, which is a PITA. > > ZnCharacterEncoder class>>#utf8 actually uses a cached default instance (see ZnUTF8Encoder class>>#default). As this is the most used encoder, with no customization options, it makes sense. For example, #utf8Encoded and #utf8Decoded use this instance. > > However, some (byte, utf16, utf32) encoders do have state (endianness, strictness, ...), so global caching could be an issue. > > Users of the encoders are free to cache their own instances. > > I am not (yet) convinced that a global cache is needed. Somehow I missed the existence of ZnCharacterEncoder class>>#utf8. That certainly covers the cases I'm think of, so I'll just use it. Thanks again, Alistair |
Free forum by Nabble | Edit this page |