Smalltalk › Pharo › Pharo Smalltalk Developers

ZnCharacterEncoder default instances

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

alistairgrant

ZnCharacterEncoder default instances

Hi Sven and Everyone,

Instantiation of character encoders is relatively expensive, as can be
seen by the following code snippet:

| count str ba enc new old |

count := 10000.
str := 'test-äöü'.
new := [
count timesRepeat: [
(ZnCharacterEncoder newForEncoding: 'utf8') encodeString: str
] ] timeToRunWithoutGC.
enc := ZnCharacterEncoder newForEncoding: 'utf8'.
old := [
count timesRepeat: [
enc encodeString: str ] ] timeToRunWithoutGC.
{ old. new. }

" #(14 122)"

What do you think of the idea of caching default instances of encoders?

My idea was to have a dictionary in ZnCharacterEncoder and add a class
method #forEncoding:. This would lazily create an instance, store it
in the dictionary, and then re-use the instance whenever that encoder
is requested. If people really want their own instance (not sure why,
there aren't any instance variables, and so there's no state), they
can use the existing #newForEncoding:.

For example, sending #exists to a file reference currently
instantiates two encoders.

This was actually previously fixed by #20273[1], but seems to have
been lost with the refactoring of FilePluginPrims to File. From
memory, Cyril's application saved 12 seconds by caching the encoder.
But in that case the instance was private to the file plugin
primitives. This would be a more general solution.

[1] https://pharo.fogbugz.com/f/cases/20273/FilePluginPrims-could-save-an-encoder-to-speedup-some-methods

Thanks,
Alistair

alistairgrant

Re: ZnCharacterEncoder default instances

On Tue, 28 Aug 2018 at 12:43, Alistair Grant <[hidden email]> wrote:

>
> Hi Sven and Everyone,
>
> Instantiation of character encoders is relatively expensive, as can be
> seen by the following code snippet:
>
> | count str ba enc new old |
>
> count := 10000.
> str := 'test-äöü'.
> new := [
> count timesRepeat: [
> (ZnCharacterEncoder newForEncoding: 'utf8') encodeString: str
> ] ] timeToRunWithoutGC.
> enc := ZnCharacterEncoder newForEncoding: 'utf8'.
> old := [
> count timesRepeat: [
> enc encodeString: str ] ] timeToRunWithoutGC.
> { old. new. }
>
> " #(14 122)"
>
> What do you think of the idea of caching default instances of encoders?
>
> My idea was to have a dictionary in ZnCharacterEncoder and add a class
> method #forEncoding:. This would lazily create an instance, store it
> in the dictionary, and then re-use the instance whenever that encoder
> is requested. If people really want their own instance (not sure why,
> there aren't any instance variables, and so there's no state), they
> can use the existing #newForEncoding:.
>
> For example, sending #exists to a file reference currently
> instantiates two encoders.
>
> This was actually previously fixed by #20273[1], but seems to have
> been lost with the refactoring of FilePluginPrims to File. From
> memory, Cyril's application saved 12 seconds by caching the encoder.
> But in that case the instance was private to the file plugin
> primitives. This would be a more general solution.
>
> [1] https://pharo.fogbugz.com/f/cases/20273/FilePluginPrims-could-save-an-encoder-to-speedup-some-methods

Just so people don't focus on the wrong thing:

I realise that the ratio (almost 10:1 above) drops quickly as the
string being converted grows in length. But if we take typical file
names, the ratio, while smaller, is still significant, as suggested by
Cyril's anecdote.

Thanks,
Alistair

Sven Van Caekenberghe-2

Re: ZnCharacterEncoder default instances

In reply to this post by alistairgrant

Hi Alistair,

Thanks for the feedback, but consider:

count := 10000.
str := 'test-äöü'.
[
count timesRepeat: [
ZnCharacterEncoder utf8 encodeString: str
] ] timeToRunWithoutGC.

=> 8

#newForEncoding: is indeed slow as it has to look through all known encoders, which also involves a subclasses search, which is a PITA.

ZnCharacterEncoder class>>#utf8 actually uses a cached default instance (see ZnUTF8Encoder class>>#default). As this is the most used encoder, with no customization options, it makes sense. For example, #utf8Encoded and #utf8Decoded use this instance.

However, some (byte, utf16, utf32) encoders do have state (endianness, strictness, ...), so global caching could be an issue.

Users of the encoders are free to cache their own instances.

I am not (yet) convinced that a global cache is needed.

Sven

> On 28 Aug 2018, at 12:43, Alistair Grant <[hidden email]> wrote:
>
> Hi Sven and Everyone,
>
> Instantiation of character encoders is relatively expensive, as can be
> seen by the following code snippet:
>
> | count str ba enc new old |
>
> count := 10000.
> str := 'test-äöü'.
> new := [
> count timesRepeat: [
> (ZnCharacterEncoder newForEncoding: 'utf8') encodeString: str
> ] ] timeToRunWithoutGC.
> enc := ZnCharacterEncoder newForEncoding: 'utf8'.
> old := [
> count timesRepeat: [
> enc encodeString: str ] ] timeToRunWithoutGC.
> { old. new. }
>
> " #(14 122)"
>
> What do you think of the idea of caching default instances of encoders?
>
> My idea was to have a dictionary in ZnCharacterEncoder and add a class
> method #forEncoding:. This would lazily create an instance, store it
> in the dictionary, and then re-use the instance whenever that encoder
> is requested. If people really want their own instance (not sure why,
> there aren't any instance variables, and so there's no state), they
> can use the existing #newForEncoding:.
>
> For example, sending #exists to a file reference currently
> instantiates two encoders.
>
> This was actually previously fixed by #20273[1], but seems to have
> been lost with the refactoring of FilePluginPrims to File. From
> memory, Cyril's application saved 12 seconds by caching the encoder.
> But in that case the instance was private to the file plugin
> primitives. This would be a more general solution.
>
> [1] https://pharo.fogbugz.com/f/cases/20273/FilePluginPrims-could-save-an-encoder-to-speedup-some-methods
>
> Thanks,
> Alistair
>

alistairgrant

Re: ZnCharacterEncoder default instances

Hi Sven,

Thanks for the quick reply.

On Tue, 28 Aug 2018 at 13:17, Sven Van Caekenberghe <[hidden email]> wrote:

>
> Hi Alistair,
>
> Thanks for the feedback, but consider:
>
> count := 10000.
> str := 'test-äöü'.
> [
> count timesRepeat: [
> ZnCharacterEncoder utf8 encodeString: str
> ] ] timeToRunWithoutGC.
>
> => 8
>
> #newForEncoding: is indeed slow as it has to look through all known encoders, which also involves a subclasses search, which is a PITA.
>
> ZnCharacterEncoder class>>#utf8 actually uses a cached default instance (see ZnUTF8Encoder class>>#default). As this is the most used encoder, with no customization options, it makes sense. For example, #utf8Encoded and #utf8Decoded use this instance.
>
> However, some (byte, utf16, utf32) encoders do have state (endianness, strictness, ...), so global caching could be an issue.
>
> Users of the encoders are free to cache their own instances.
>
> I am not (yet) convinced that a global cache is needed.

Somehow I missed the existence of ZnCharacterEncoder class>>#utf8.
That certainly covers the cases I'm think of, so I'll just use it.

Thanks again,
Alistair