Smalltalk › Squeak › Squeak VM

InterpreterSimulator

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

47 messages Options

123

Levente Uzonyi

Adaptive PIC scheme (Re: [Vm-dev] InterpreterSimulator)

Hi Clément,

On Tue, 15 Mar 2016, Clément Bera wrote:

(no quote again, sorry)

So, the idea is to add a counter to each cache entry, and increase it on
each hit. Once a counter reaches a given limit, divide all counters by
two[1], and optionally do housekeeping stuff.
By housekeeping I mean removing entries with zero counter value, resorting
the entries based on the counter values, and optionally reverting to a
single method when there's only one cache entry left.
By optionally I mean that if you don't want to do the housekeeping every
time the limit is reached, then another counter can be introduced, so that
only every nth time will housekeeping occur.
Since there are at most six cache entries, sorting can be done by
Insertion sort or a hardcoded Optimal sorting network. The former would be
simple and efficient, the latter could be even more efficient but more
complex as well.

In practice this can be done (at least) two ways:

1. If there's a free byte at each entry, then the cache entries can store
their own counters. If we use single byte counters, then it's practical to
use 256 as the limit. That way division happens when the counter
overflows. That takes at least 128 cache hits between two divisions, but
if we eventually revert to a single method when there's only one entry
left in the PIC, then the average number will be slightly higher than 128.

Pros:
- simple bookkeeping
- sorting can be done in-place
Cons:
- there's probably no free byte
- if we don't want to do the housekeeping too often, then the PIC
will need a few more extra bytes to store another counter for that

2. Add a single 64-bit word to the PIC. The lower 6 bytes can be used for
the 6 counters, one per cache entry. The remaining two bytes can hold the
housekeeping counter.

Pros:
- still fairly simple bookkeeping, but overflows will have side-effects on
the next byte. This can be worked around by leaving an empty bit
between the counter bytes, at the cost of reducing the size of the
housekeeping counter. Another option is to check the counter value before
incrementing it to avoid overflows altogether.
- dividing all counters by two can be done in a few machine instructions
- 7-bit or 9-bit counters can be used as well
Cons
- incrementing a counter is a bit more complex
- sorting is trickier because the counter values will have to be
swapped as well

Assuming there's no free byte, I'd start straight with 2., but even adding
a 64-bit word to a PIC doesn't seem too easy.

Levente

[1] This scheme holds information about not that recent hits as well, but
the effect of the past decreases exponentially, so it's a fairly good
approximation of the most recent hits without the need to store any
additional data. I came up with this method a few years ago, yet I still
couldn't find out what it's called.

Eliot Miranda-2

Re: Adaptive PIC scheme (Re: [Vm-dev] InterpreterSimulator)

Hi Guys,

On Tue, Mar 15, 2016 at 11:46 AM, Levente Uzonyi <[hidden email]> wrote:

Hi Clément,

On Tue, 15 Mar 2016, Clément Bera wrote:

(no quote again, sorry)

So, the idea is to add a counter to each cache entry, and increase it on each hit. Once a counter reaches a given limit, divide all counters by two[1], and optionally do housekeeping stuff.
By housekeeping I mean removing entries with zero counter value, resorting the entries based on the counter values, and optionally reverting to a single method when there's only one cache entry left.
By optionally I mean that if you don't want to do the housekeeping every time the limit is reached, then another counter can be introduced, so that only every nth time will housekeeping occur.
Since there are at most six cache entries, sorting can be done by Insertion sort or a hardcoded Optimal sorting network. The former would be simple and efficient, the latter could be even more efficient but more complex as well.

In practice this can be done (at least) two ways:

1. If there's a free byte at each entry, then the cache entries can store their own counters. If we use single byte counters, then it's practical to use 256 as the limit. That way division happens when the counter overflows. That takes at least 128 cache hits between two divisions, but if we eventually revert to a single method when there's only one entry left in the PIC, then the average number will be slightly higher than 128.

Pros:
- simple bookkeeping
- sorting can be done in-place
Cons:
- there's probably no free byte
- if we don't want to do the housekeeping too often, then the PIC will need a few more extra bytes to store another counter for that

2. Add a single 64-bit word to the PIC. The lower 6 bytes can be used for the 6 counters, one per cache entry. The remaining two bytes can hold the housekeeping counter.

Pros:
- still fairly simple bookkeeping, but overflows will have side-effects on the next byte. This can be worked around by leaving an empty bit
between the counter bytes, at the cost of reducing the size of the housekeeping counter. Another option is to check the counter value before incrementing it to avoid overflows altogether.
- dividing all counters by two can be done in a few machine instructions
- 7-bit or 9-bit counters can be used as well
Cons
- incrementing a counter is a bit more complex
- sorting is trickier because the counter values will have to be swapped as well

Assuming there's no free byte, I'd start straight with 2., but even adding a 64-bit word to a PIC doesn't seem too easy.

Levente

[1] This scheme holds information about not that recent hits as well, but the effect of the past decreases exponentially, so it's a fairly good approximation of the most recent hits without the need to store any additional data. I came up with this method a few years ago, yet I still couldn't find out what it's called.

Reorganizing 6-element PICs doesn't really make sense. The cost of the counters would likely outweigh the benefits of reordering the PICs. Look at the time taken for a monomorphic send vs the worst case PIC send (6th case), vs a megamorphic send:

| n |

n := 1000000000.

{[| v |

v := 1.

1 to: n do:

[:i|

i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].

[| v |

v := 1.

1 to: n do:

[:i|

v species.

i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].

[| v |

v := 1.

1 to: n do:

[:i|

v species.

i < 6 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].

[| v |

v := 1.

1 to: n do:

[:i|

v species.

i < 7 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]]} collect: [:ea| ea timeToRun] #(2716 7013 7376 8350)

So on a 2.2GHz MacBook Pro Core i7 the monomorphic send of #species (block 2) takes (7.013 - 2.716) / 1,000,000,000, = 4.3 nanoseconds; the polymorphic send in the 6th case of a PIC (block 3) takes (7.376 - 2.716) / 1,000,000,000, = 4.66 nanoseconds. Only the open PIC send (full hashed lookup) takes significantly longer, (8.35 - 2.716) / 1,000,000,000 = 5.63 nanoseconds.

Let's add a counter; this is probably a lot more expensive than a VM counter implementation, but it gives us a sense of scale.

| n |

n := 1000000000.

{[| v c |

c := 0. v := 1.

1 to: n do:

[:i|

c := c + 1.

i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].

[| v c |

c := 0. v := 1.

1 to: n do:

[:i|

c := c + 1.

v species.

i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].

[| v c |

c := 0. v := 1.

1 to: n do:

[:i|

c := c + 1.

v species.

i < 6 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].

[| v c |

c := 0. v := 1.

1 to: n do:

[:i|

c := c + 1.

v species.

i < 7 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]]} collect: [:ea| ea timeToRun] #(3399 7755 8361 9017)

So a counter adds approximately #(7755 8361 9017) sum - #(7013 7376 8350) sum / (#(7013 7376 8350) collect: [:eq| ea - 2716]) sum = 16% overhead, whereas the difference between a monomorphic send and the worst case polymorphic send is 7376 - 2716 / (7013 - 2716) asFloat = 8% slower. So the maximum improvement counters could make would be 8%. Even if their cost fell to 1/4 of the cost above they would cost 50% of the maximum return they could deliver. It's not worth the complexity; the gains would be in the noise.

<N.B.> The above PIC sends contain an error in my methodology. Tim redesigned PICs late last year so that a two element PIC does a test and then jumps to the 6th case to perform the next test. If the PIC is extended that jump is modified to jump to the 5th case, the next extension changes to the 4th case and so on. So in fact, to measure the cost of a send to the 6th element of the PIC we need to change v back to the have the class of the 2nd case:

[| v |

v := 1.

1 to: n do:

[:i|

v species.

i < 7 ifTrue: [v := #(1.0 #one 'one' #[1]. 1.0. $1. #(1)) at: i]]].

If you try this the above numbers don't change significantly, so I've been lazy and haven't redone my analysis, but here's an example run:

| n |

n := 1000000000.

{[| v |

v := 1.

1 to: n do:

[:i|

i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].

[| v |

v := 1.

1 to: n do:

[:i|

v species.

i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].

[| v |

v := 1.

1 to: n do:

[:i|

v species.

i < 7 ifTrue: [v := #(1.0 #one 'one' #[1]. 1.0. $1. #(1)) at: i]]].

[| v |

v := 1.

1 to: n do:

[:i|

v species.

i < 7 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]]} collect: [:ea| ea timeToRun] #(2778 7370 7683 8683)

</N.B.>

Some remarks.

Counters are expensive. Sista already goes to some effort to minimise their costs. We count conditional branches instead of full sends. The green book presents measurements on the Smalltalk-80 V2 image of 1983 that says conditional branch bytecodes (as in inlined ifTrue:ifFalse:, and: to:do: et al) are 6 times less frequent than send bytecodes, and in today's code that's likely fallen a little as people write more OO code.

Architectural solutions such as Spur and Sista offer much mofre substantial gains; Spur shows a -40% speedup across a range of benchmarks. I expect Sista will deliver more when we are more aggressive with register allocation and the range of optmizations we implement although right now we're seeing similar gains to Spur. My expectation is based both on the success of other adaptive optimizers, and my experience in tuning OO systems, where the first impelmentation is almost universally slower than the one it replaes, the full gains only cming after careful performance measurement and tuning; Sista is very young as yet.

_,,,^..^,,,_

best, Eliot

Bert Freudenberg

Re: Adaptive PIC scheme (Re: [Vm-dev] InterpreterSimulator)

On 15.03.2016, at 20:28, Eliot Miranda <[hidden email]> wrote:
> #(1.0 #one 'one' #[1]. $1. #(1))

#(1.0 #one 'one' #[1]. $1. #(1)) size
==> 8

(not 6 as intended)

- Bert -

smime.p7s (5K) Download Attachment

Eliot Miranda-2

Re: Adaptive PIC scheme (Re: [Vm-dev] InterpreterSimulator)

On Tue, Mar 15, 2016 at 1:15 PM, Bert Freudenberg <[hidden email]> wrote:

On 15.03.2016, at 20:28, Eliot Miranda <[hidden email]> wrote:
> #(1.0 #one 'one' #[1]. $1. #(1))

#(1.0 #one 'one' #[1]. $1. #(1)) size
==> 8

(not 6 as intended)

- Bert -

Thanks, Bert. If one corrects there's no significant change, so the figures are still representative. Reproducible times on a Mac would be nice to have... ;-) I can get two congruent runs, but timing is all over the place by the third run :-(

_,,,^..^,,,_

best, Eliot

Clément Béra

Re: Adaptive PIC scheme (Re: [Vm-dev] InterpreterSimulator)

Hello,

I was considering counters in the PICs not necessarily to reorder the PIC but also to make available the frequency information of each different types available to the sista optimizer, in a similar way to the Dart VM where a lot of profiling information is available. Two cases then comes to mind, specializing the code only for the most frequently used type and compile the fall back for least recently used types or generate optimized code with prefilled inlined cache, including PICs in the correct order, on optimized code. Obviously that's quite some work, so we have to measure what we would really earn.

As Eliot mentioned, the counters cannot be inlined in the PIC as each time one modifies the machine code the cpu instruction cache line is flushed, so inlined counters implies a huge slow down. The divide by 2 idea sounds cool though. I was also thinking that the housekeeping could happen during machine code garbage collection, as the PIC is moved anyway, reordering at that time might be cheaper.

2016-03-16 2:19 GMT+01:00 Eliot Miranda <[hidden email]>:

On Tue, Mar 15, 2016 at 1:15 PM, Bert Freudenberg <[hidden email]> wrote:

On 15.03.2016, at 20:28, Eliot Miranda <[hidden email]> wrote:
> #(1.0 #one 'one' #[1]. $1. #(1))

#(1.0 #one 'one' #[1]. $1. #(1)) size
==> 8

(not 6 as intended)

- Bert -

Thanks, Bert. If one corrects there's no significant change, so the figures are still representative. Reproducible times on a Mac would be nice to have... ;-) I can get two congruent runs, but timing is all over the place by the third run :-(

_,,,^..^,,,_
best, Eliot

Ben Coman

Re: Adaptive PIC scheme (Re: [Vm-dev] InterpreterSimulator)

On Wed, Mar 16, 2016 at 10:50 AM, Clément Bera <[hidden email]> wrote:
>
> Hello,
>
> I was considering counters in the PICs not necessarily to reorder the PIC but also to make available the frequency information of each different types available to the sista optimizer, in a similar way to the Dart VM where a lot of profiling information is available. Two cases then comes to mind, specializing the code only for the most frequently used type and compile the fall back for least recently used types or generate optimized code with prefilled inlined cache, including PICs in the correct order, on optimized code. Obviously that's quite some work, so we have to measure what we would really earn.
>
> As Eliot mentioned, the counters cannot be inlined in the PIC as each time one modifies the machine code the cpu instruction cache line is flushed, so inlined counters implies a huge slow down. The divide by 2 idea sounds cool though. I was also thinking that the housekeeping could happen during machine code garbage collection, as the PIC is moved anyway, reordering at that time might be cheaper.

Just some wild speculation. I don't really need a reply - just to say
them out loud so I can move on.

* If the data needs to be stored away from the PIC, then maybe it can
even be queued for processing by a second CPU. A small gain
possible(?) in spite of the main VM being not multi-processor.

* An obvious thing would be only running expensive counters during
unit tests, but I guess that workload might be different to real life.

* I wonder also if such a mechanism for counting might be re-purposed
for profiling execution time (excepting that existing solution are
sufficient)

cheers -ben

>
> 2016-03-16 2:19 GMT+01:00 Eliot Miranda <[hidden email]>:
>>
>>
>>
>> On Tue, Mar 15, 2016 at 1:15 PM, Bert Freudenberg <[hidden email]> wrote:
>>>
>>>
>>> On 15.03.2016, at 20:28, Eliot Miranda <[hidden email]> wrote:
>>> > #(1.0 #one 'one' #[1]. $1. #(1))
>>>
>>> #(1.0 #one 'one' #[1]. $1. #(1)) size
>>> ==> 8
>>>
>>> (not 6 as intended)
>>>
>>> - Bert -
>>
>>
>> Thanks, Bert. If one corrects there's no significant change, so the figures are still representative. Reproducible times on a Mac would be nice to have... ;-) I can get two congruent runs, but timing is all over the place by the third run :-(
>>
>> _,,,^..^,,,_
>> best, Eliot
>>
>
>

Eliot Miranda-2

Re: Adaptive PIC scheme (Re: [Vm-dev] InterpreterSimulator)

Hi Ben, Hi Levente,

> On Mar 15, 2016, at 11:09 PM, Ben Coman <[hidden email]> wrote:
>
>
>> On Wed, Mar 16, 2016 at 10:50 AM, Clément Bera <[hidden email]> wrote:
>>
>> Hello,
>>
>> I was considering counters in the PICs not necessarily to reorder the PIC but also to make available the frequency information of each different types available to the sista optimizer, in a similar way to the Dart VM where a lot of profiling information is available. Two cases then comes to mind, specializing the code only for the most frequently used type and compile the fall back for least recently used types or generate optimized code with prefilled inlined cache, including PICs in the correct order, on optimized code. Obviously that's quite some work, so we have to measure what we would really earn.
>>
>> As Eliot mentioned, the counters cannot be inlined in the PIC as each time one modifies the machine code the cpu instruction cache line is flushed, so inlined counters implies a huge slow down. The divide by 2 idea sounds cool though. I was also thinking that the housekeeping could happen during machine code garbage collection, as the PIC is moved anyway, reordering at that time might be cheaper.
>
> Just some wild speculation. I don't really need a reply - just to say
> them out loud so I can move on.
>
> * If the data needs to be stored away from the PIC, then maybe it can
> even be queued for processing by a second CPU. A small gain
> possible(?) in spite of the main VM being not multi-processor.

At least on x86/x86_64 it's essential to keep the counters far away from the code because the read-modify-write cycle for the counter update flushes the processor internal CISC-to-RISC JIT instruction cache and results in truly abysmal performance.

>
> * An obvious thing would be only running expensive counters during
> unit tests, but I guess that workload might be different to real life.
>
> * I wonder also if such a mechanism for counting might be re-purposed
> for profiling execution time (excepting that existing solution are
> sufficient)
>
> cheers -ben
>
>
>>
>> 2016-03-16 2:19 GMT+01:00 Eliot Miranda <[hidden email]>:
>>>
>>>
>>>
>>>> On Tue, Mar 15, 2016 at 1:15 PM, Bert Freudenberg <[hidden email]> wrote:
>>>>
>>>>
>>>>> On 15.03.2016, at 20:28, Eliot Miranda <[hidden email]> wrote:
>>>>> #(1.0 #one 'one' #[1]. $1. #(1))
>>>>
>>>> #(1.0 #one 'one' #[1]. $1. #(1)) size
>>>> ==> 8
>>>>
>>>> (not 6 as intended)
>>>>
>>>> - Bert -
>>>
>>>
>>> Thanks, Bert. If one corrects there's no significant change, so the figures are still representative. Reproducible times on a Mac would be nice to have... ;-) I can get two congruent runs, but timing is all over the place by the third run :-(
>>>
>>> _,,,^..^,,,_
>>> best, Eliot
>>
>>

123