String deduplication

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

String deduplication

Philippe Marschall-2
Hi

This is an idea I stole from somebody else. The assumption is that you
have a lot of Strings in the image that are equal. We could therefore
remove the duplicates and make all the objects refer to the same instance.

However it's not a simple as that. The main issue is that String has two
responsibilities. The first is as an immutable value object. The second
is as a mutable character buffer for building immutable value objects.
We must not deduplicate the second kind. Unfortunately it's not straight
forward to figure out which kind a string is. The approach I took is
looking at whether it contains any 0 characters. An other option would
be to check whether any WirteStreams are referring to it.
Also, since there are behavioral differences between String and Symbol
besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in
the heap and they compare #= true but we must not make anybody who
refers to 'hello' suddenly refer to #'hello').

Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3
image. Sorry for the bad variable names.

| b d m |
b := Bag new.
d := OrderedCollection new.
m := Dictionary new.
"count all string instances"
String allSubInstancesDo: [ :s |
     s isSymbol ifFalse: [
         b add: s ] ].
"find the ones that have no duplicates or are likely buffers"
b doWithOccurrences: [ :s :i |
     (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
         d add: s -> i ] ].
"remove the ones that have no duplicates or are likely buffers"
d do: [ :a |
     a value timesRepeat: [
         b remove: a key ]  ].
"map all duplicate strings to their duplicates"
String allSubInstancesDo: [ :s |
     s isSymbol ifFalse: [
         (b includes: s) ifTrue: [
             | l |
             l := m at: s ifAbsentPut: [ OrderedCollection new ].
             l add: s  ] ].
"remove the duplicates"
m keysAndValues do [ :k :v |
     | f |
     f := v at: 1.
     2 to: v size do: [ :i |
         (v at: i) becomeForward: f ] ]

Cheers
Philippe


Reply | Threaded
Open this post in threaded view
|

Re: String deduplication

Marcus Denker-4

On 30 May 2014, at 09:39, Philippe Marschall <[hidden email]> wrote:

> Hi
>
> This is an idea I stole from somebody else. The assumption is that you have a lot of Strings in the image that are equal. We could therefore remove the duplicates and make all the objects refer to the same instance.
>
> However it's not a simple as that. The main issue is that String has two responsibilities. The first is as an immutable value object. The second is as a mutable character buffer for building immutable value objects. We must not deduplicate the second kind. Unfortunately it's not straight forward to figure out which kind a string is. The approach I took is looking at whether it contains any 0 characters. An other option would be to check whether any WirteStreams are referring to it.

One idea could be to have an explicit immutable string literal class (or to set the immutability for literals in the compiler when we have support for it for literals).
These then would be save to de-duplicate, would be interesting to now how large the percentage is of literals among all your de-douplicated strings.

When playing with first class references (that can in addition override behavior) we had the idea that one could use that for realising a general “copy on write” for any kind of object and even
whole object graphs… the idea would be to  search for object graphs that are the same and replace the all pointers with just pointer-objects that trap all writes to one singe
copy.

In a second step one could combine that with “crystallising” unused subgraphs (that is, serialise and compress them in memory, with references to the graph on-demand and transparently
decompress). Combined, this would save a lot of space.

But the devil is in the detail: how to find unused and equal subgraphs efficiently.

> Also, since there are behavioral differences between String and Symbol besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the heap and they compare #= true but we must not make anybody who refers to 'hello' suddenly refer to #'hello').
>
> Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3 image. Sorry for the bad variable names.
>
Nice!

        Marcus
Reply | Threaded
Open this post in threaded view
|

Re: String deduplication

Clément Béra
In reply to this post by Philippe Marschall-2
Hello,

I like the idea but this is not as simple.

In some framework you may use different string with a same name as markers that are not equals.

Typically:

Object>>#string1
    ^ 'string'

Object>>#string2
    ^ 'string'

Object>>#test
    self assert: self string1 == self string1. "Answers true"
    self assert: self string2 == self string2. "Answers true"
    self assert: self string1 == self string2 "Answers false"

Frameworks relying on that will not work any more.

And this kind of bugs is not easy to spot, it typically crashes identity collections in a non deterministic fashion.

Regards


2014-05-30 9:39 GMT+02:00 Philippe Marschall <[hidden email]>:
Hi

This is an idea I stole from somebody else. The assumption is that you have a lot of Strings in the image that are equal. We could therefore remove the duplicates and make all the objects refer to the same instance.

However it's not a simple as that. The main issue is that String has two responsibilities. The first is as an immutable value object. The second is as a mutable character buffer for building immutable value objects. We must not deduplicate the second kind. Unfortunately it's not straight forward to figure out which kind a string is. The approach I took is looking at whether it contains any 0 characters. An other option would be to check whether any WirteStreams are referring to it.
Also, since there are behavioral differences between String and Symbol besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the heap and they compare #= true but we must not make anybody who refers to 'hello' suddenly refer to #'hello').

Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3 image. Sorry for the bad variable names.

| b d m |
b := Bag new.
d := OrderedCollection new.
m := Dictionary new.
"count all string instances"
String allSubInstancesDo: [ :s |
    s isSymbol ifFalse: [
        b add: s ] ].
"find the ones that have no duplicates or are likely buffers"
b doWithOccurrences: [ :s :i |
    (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
        d add: s -> i ] ].
"remove the ones that have no duplicates or are likely buffers"
d do: [ :a |
    a value timesRepeat: [
        b remove: a key ]  ].
"map all duplicate strings to their duplicates"
String allSubInstancesDo: [ :s |
    s isSymbol ifFalse: [
        (b includes: s) ifTrue: [
            | l |
            l := m at: s ifAbsentPut: [ OrderedCollection new ].
            l add: s  ] ].
"remove the duplicates"
m keysAndValues do [ :k :v |
    | f |
    f := v at: 1.
    2 to: v size do: [ :i |
        (v at: i) becomeForward: f ] ]

Cheers
Philippe



Reply | Threaded
Open this post in threaded view
|

Re: String deduplication

Sven Van Caekenberghe-2
Hmm, code that depends on 2 #= Strings to be not #== ?
That would either be a very special case but more likely a bug.
In any case, it won't be very common I guess.

The mutability/immutability is a way more important issue, unfixable unless we introduce a new class IMHO.

But saving 2MB is impressive.

On 30 May 2014, at 10:59, Clément Bera <[hidden email]> wrote:

> Hello,
>
> I like the idea but this is not as simple.
>
> In some framework you may use different string with a same name as markers that are not equals.
>
> Typically:
>
> Object>>#string1
>     ^ 'string'
>
> Object>>#string2
>     ^ 'string'
>
> Object>>#test
>     self assert: self string1 == self string1. "Answers true"
>     self assert: self string2 == self string2. "Answers true"
>     self assert: self string1 == self string2 "Answers false"
>
> Frameworks relying on that will not work any more.
>
> And this kind of bugs is not easy to spot, it typically crashes identity collections in a non deterministic fashion.
>
> Regards
>
>
> 2014-05-30 9:39 GMT+02:00 Philippe Marschall <[hidden email]>:
> Hi
>
> This is an idea I stole from somebody else. The assumption is that you have a lot of Strings in the image that are equal. We could therefore remove the duplicates and make all the objects refer to the same instance.
>
> However it's not a simple as that. The main issue is that String has two responsibilities. The first is as an immutable value object. The second is as a mutable character buffer for building immutable value objects. We must not deduplicate the second kind. Unfortunately it's not straight forward to figure out which kind a string is. The approach I took is looking at whether it contains any 0 characters. An other option would be to check whether any WirteStreams are referring to it.
> Also, since there are behavioral differences between String and Symbol besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the heap and they compare #= true but we must not make anybody who refers to 'hello' suddenly refer to #'hello').
>
> Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3 image. Sorry for the bad variable names.
>
> | b d m |
> b := Bag new.
> d := OrderedCollection new.
> m := Dictionary new.
> "count all string instances"
> String allSubInstancesDo: [ :s |
>     s isSymbol ifFalse: [
>         b add: s ] ].
> "find the ones that have no duplicates or are likely buffers"
> b doWithOccurrences: [ :s :i |
>     (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
>         d add: s -> i ] ].
> "remove the ones that have no duplicates or are likely buffers"
> d do: [ :a |
>     a value timesRepeat: [
>         b remove: a key ]  ].
> "map all duplicate strings to their duplicates"
> String allSubInstancesDo: [ :s |
>     s isSymbol ifFalse: [
>         (b includes: s) ifTrue: [
>             | l |
>             l := m at: s ifAbsentPut: [ OrderedCollection new ].
>             l add: s  ] ].
> "remove the duplicates"
> m keysAndValues do [ :k :v |
>     | f |
>     f := v at: 1.
>     2 to: v size do: [ :i |
>         (v at: i) becomeForward: f ] ]
>
> Cheers
> Philippe
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: String deduplication

Marcus Denker-4
In reply to this post by Clément Béra

On 30 May 2014, at 10:59, Clément Bera <[hidden email]> wrote:

Hello,

I like the idea but this is not as simple.

In some framework you may use different string with a same name as markers that are not equals.

Typically:

Object>>#string1
    ^ 'string'

Object>>#string2
    ^ 'string'

Object>>#test
    self assert: self string1 == self string1. "Answers true"
    self assert: self string2 == self string2. "Answers true"
    self assert: self string1 == self string2 "Answers false"

Frameworks relying on that will not work any more.

And this kind of bugs is not easy to spot, it typically crashes identity collections in a non deterministic fashion.


With an indirection (a kind of reference) that

-> points to the string
-> forwards everything, but does a copy on write on state change
-> implements == to return false


it would work. Of course you have then the same amount of objects(+1), but they would be all very 
small, thus leading to saving for large objects and especially when applied to subgraphs.

Marcus


Regards


2014-05-30 9:39 GMT+02:00 Philippe Marschall <[hidden email]>:
Hi

This is an idea I stole from somebody else. The assumption is that you have a lot of Strings in the image that are equal. We could therefore remove the duplicates and make all the objects refer to the same instance.

However it's not a simple as that. The main issue is that String has two responsibilities. The first is as an immutable value object. The second is as a mutable character buffer for building immutable value objects. We must not deduplicate the second kind. Unfortunately it's not straight forward to figure out which kind a string is. The approach I took is looking at whether it contains any 0 characters. An other option would be to check whether any WirteStreams are referring to it.
Also, since there are behavioral differences between String and Symbol besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the heap and they compare #= true but we must not make anybody who refers to 'hello' suddenly refer to #'hello').

Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3 image. Sorry for the bad variable names.

| b d m |
b := Bag new.
d := OrderedCollection new.
m := Dictionary new.
"count all string instances"
String allSubInstancesDo: [ :s |
    s isSymbol ifFalse: [
        b add: s ] ].
"find the ones that have no duplicates or are likely buffers"
b doWithOccurrences: [ :s :i |
    (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
        d add: s -> i ] ].
"remove the ones that have no duplicates or are likely buffers"
d do: [ :a |
    a value timesRepeat: [
        b remove: a key ]  ].
"map all duplicate strings to their duplicates"
String allSubInstancesDo: [ :s |
    s isSymbol ifFalse: [
        (b includes: s) ifTrue: [
            | l |
            l := m at: s ifAbsentPut: [ OrderedCollection new ].
            l add: s  ] ].
"remove the duplicates"
m keysAndValues do [ :k :v |
    | f |
    f := v at: 1.
    2 to: v size do: [ :i |
        (v at: i) becomeForward: f ] ]

Cheers
Philippe




Reply | Threaded
Open this post in threaded view
|

Re: String deduplication

Nicolas Cellier
I'm curious to see which method is really using mutable strings.
Of course, while constructing and write streaming on it, but then it's like a temporary storage area and we don't have to care.
We're speaking of a place where a String would be modified to retain some state...
Of course, since we can't exclude this possibility theoretically, the proposed changed is unsafe.
But practically...


2014-05-30 13:46 GMT+02:00 Marcus Denker <[hidden email]>:

On 30 May 2014, at 10:59, Clément Bera <[hidden email]> wrote:

Hello,

I like the idea but this is not as simple.

In some framework you may use different string with a same name as markers that are not equals.

Typically:

Object>>#string1
    ^ 'string'

Object>>#string2
    ^ 'string'

Object>>#test
    self assert: self string1 == self string1. "Answers true"
    self assert: self string2 == self string2. "Answers true"
    self assert: self string1 == self string2 "Answers false"

Frameworks relying on that will not work any more.

And this kind of bugs is not easy to spot, it typically crashes identity collections in a non deterministic fashion.


With an indirection (a kind of reference) that

-> points to the string
-> forwards everything, but does a copy on write on state change
-> implements == to return false


it would work. Of course you have then the same amount of objects(+1), but they would be all very 
small, thus leading to saving for large objects and especially when applied to subgraphs.

Marcus


Regards


2014-05-30 9:39 GMT+02:00 Philippe Marschall <[hidden email]>:
Hi

This is an idea I stole from somebody else. The assumption is that you have a lot of Strings in the image that are equal. We could therefore remove the duplicates and make all the objects refer to the same instance.

However it's not a simple as that. The main issue is that String has two responsibilities. The first is as an immutable value object. The second is as a mutable character buffer for building immutable value objects. We must not deduplicate the second kind. Unfortunately it's not straight forward to figure out which kind a string is. The approach I took is looking at whether it contains any 0 characters. An other option would be to check whether any WirteStreams are referring to it.
Also, since there are behavioral differences between String and Symbol besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the heap and they compare #= true but we must not make anybody who refers to 'hello' suddenly refer to #'hello').

Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3 image. Sorry for the bad variable names.

| b d m |
b := Bag new.
d := OrderedCollection new.
m := Dictionary new.
"count all string instances"
String allSubInstancesDo: [ :s |
    s isSymbol ifFalse: [
        b add: s ] ].
"find the ones that have no duplicates or are likely buffers"
b doWithOccurrences: [ :s :i |
    (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
        d add: s -> i ] ].
"remove the ones that have no duplicates or are likely buffers"
d do: [ :a |
    a value timesRepeat: [
        b remove: a key ]  ].
"map all duplicate strings to their duplicates"
String allSubInstancesDo: [ :s |
    s isSymbol ifFalse: [
        (b includes: s) ifTrue: [
            | l |
            l := m at: s ifAbsentPut: [ OrderedCollection new ].
            l add: s  ] ].
"remove the duplicates"
m keysAndValues do [ :k :v |
    | f |
    f := v at: 1.
    2 to: v size do: [ :i |
        (v at: i) becomeForward: f ] ]

Cheers
Philippe





Reply | Threaded
Open this post in threaded view
|

Re: String deduplication

Chris Muller-3
In reply to this post by Philippe Marschall-2
I hope you're only considering this one-time for your Pharo release
image, and not something that will _continue_ to operating on an
on-going basis.  Attempting to do that below app-layer would be way
too intrusive for, for example, a Magma application..    A String read
from one DB session belongs to that session.  If it were canonicalized
and suddenly belonged to two sessions, then mutating it would end up
committing that change in both sessions.  Bad.

Strings aren't the only type to share.  What about Dates?  And
literal-Array's?  Timezones and Durations?  This is an
app-responsibility, not a system one.  A one-time compress prior to
release is okay, but not automatically behind the scenes..


On Fri, May 30, 2014 at 2:39 AM, Philippe Marschall
<[hidden email]> wrote:

> Hi
>
> This is an idea I stole from somebody else. The assumption is that you have
> a lot of Strings in the image that are equal. We could therefore remove the
> duplicates and make all the objects refer to the same instance.
>
> However it's not a simple as that. The main issue is that String has two
> responsibilities. The first is as an immutable value object. The second is
> as a mutable character buffer for building immutable value objects. We must
> not deduplicate the second kind. Unfortunately it's not straight forward to
> figure out which kind a string is. The approach I took is looking at whether
> it contains any 0 characters. An other option would be to check whether any
> WirteStreams are referring to it.
> Also, since there are behavioral differences between String and Symbol
> besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the
> heap and they compare #= true but we must not make anybody who refers to
> 'hello' suddenly refer to #'hello').
>
> Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3
> image. Sorry for the bad variable names.
>
> | b d m |
> b := Bag new.
> d := OrderedCollection new.
> m := Dictionary new.
> "count all string instances"
> String allSubInstancesDo: [ :s |
>     s isSymbol ifFalse: [
>         b add: s ] ].
> "find the ones that have no duplicates or are likely buffers"
> b doWithOccurrences: [ :s :i |
>     (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
>         d add: s -> i ] ].
> "remove the ones that have no duplicates or are likely buffers"
> d do: [ :a |
>     a value timesRepeat: [
>         b remove: a key ]  ].
> "map all duplicate strings to their duplicates"
> String allSubInstancesDo: [ :s |
>     s isSymbol ifFalse: [
>         (b includes: s) ifTrue: [
>             | l |
>             l := m at: s ifAbsentPut: [ OrderedCollection new ].
>             l add: s  ] ].
> "remove the duplicates"
> m keysAndValues do [ :k :v |
>     | f |
>     f := v at: 1.
>     2 to: v size do: [ :i |
>         (v at: i) becomeForward: f ] ]
>
> Cheers
> Philippe
>
>

Reply | Threaded
Open this post in threaded view
|

Re: String deduplication

Philippe Marschall-2
On 30.05.14 17:12, Chris Muller wrote:
> I hope you're only considering this one-time for your Pharo release
> image, and not something that will _continue_ to operating on an
> on-going basis.

Of course. Besides all the problems it's also way too slow.

Cheers
Philippe



Reply | Threaded
Open this post in threaded view
|

Re: String deduplication

demarey
In reply to this post by Philippe Marschall-2

Le 30 mai 2014 à 09:39, Philippe Marschall a écrit :

> Hi
>
> This is an idea I stole from somebody else. The assumption is that you have a lot of Strings in the image that are equal. We could therefore remove the duplicates and make all the objects refer to the same instance.


I worked on a String optimisation for a Java virtual machine dedicated to small hardwares.
A string was represented by an array of bytes (UTF8 or ASCII encoding), a start position, and the number of characters of the string.
It allows to reuse the internal byte array for different strings but the String object was different for each String.
With this approach, you are able to save a lot of memory (but with some CPU overhead) and you don't have a problems because you have different String objects for each String.

ex: b1: 'Hello World! It's a sunny day'

'Hello World! It's a sunny day' : start = 0, count = 29, value b1
'Hello' : start = 0, count = 5, value b1
'Hello World!' : start = 0, count = 12, value b1
'World': start = 6, count = 5, value b1


I don't see how it may be applied to Smalltalk and it makes sense.

Christophe.


smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: String deduplication

Chris Muller-3
This is a normal pattern in application-development.  For example,
Todd Blanchard's HTML validating parser grabs a huge chunk of HTML,
and all the parts are simply referenced by position within that big
String.

On Mon, Jun 2, 2014 at 8:26 AM, Christophe Demarey
<[hidden email]> wrote:

>
> Le 30 mai 2014 à 09:39, Philippe Marschall a écrit :
>
>> Hi
>>
>> This is an idea I stole from somebody else. The assumption is that you have a lot of Strings in the image that are equal. We could therefore remove the duplicates and make all the objects refer to the same instance.
>
>
> I worked on a String optimisation for a Java virtual machine dedicated to small hardwares.
> A string was represented by an array of bytes (UTF8 or ASCII encoding), a start position, and the number of characters of the string.
> It allows to reuse the internal byte array for different strings but the String object was different for each String.
> With this approach, you are able to save a lot of memory (but with some CPU overhead) and you don't have a problems because you have different String objects for each String.
>
> ex: b1: 'Hello World! It's a sunny day'
>
> 'Hello World! It's a sunny day' : start = 0, count = 29, value b1
> 'Hello' : start = 0, count = 5, value b1
> 'Hello World!' : start = 0, count = 12, value b1
> 'World': start = 6, count = 5, value b1
>
>
> I don't see how it may be applied to Smalltalk and it makes sense.
>
> Christophe.
>