Smalltalk › Squeak › Squeak VM

Re: [Pharo-dev] Fed up

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

3 messages Options

Eliot Miranda-2

Re: [Pharo-dev] Fed up

Hi Ronie,

On Tue, Jan 21, 2020 at 5:48 PM Ronie Salgado <[hidden email]> wrote:

That performance regression looks more like a language implementation bug that a problem of the language itself. If we assume that #do: and some other selectors (e.g. #select, #reject) should always receive a block instead of a symbol, then the compiler could perfectly replace the symbol literal for an already instantiated block literal transparently from the user perspective. If the compiler does that we can save the bytecode for instantiating the block closure, which can save a potential roundtrip to the garbage collector. I guess (I am just speculating) that this performance overhead must be in the implementation of the #perform: primitive, which I guess has to:
1) go through the JIT stack into the C stack transition (saving/restoring interpreter/JIT state, additional pressure, and primitive activation overhead.

perform:, perform:with: et al (primitive 83), but not perform:withArguments: are completely implemented in machine code with no transition to C unless the method is not in the first-level method lookup cache. See genPrimitivePerform, genLookupForPerformNumArgs: and adjustArgumentsForPerform:.

2) the lack of inline caches for #perform: (again, I am just guessing in this case).

Right. There is only the first level method lookup cache so it has interpreter-like performance. The selector and classs of receiver have to be hashed and the first-level method lookup cache probed. Way slower than block activation. I will claim though that Cog/Spur OpenSmalltalk's JIT perform implementation is as good as or better than any other Smalltalk VM's. IIRC VW only machine coded/codes perform: and perform:with:

Note that the OpalCompiler is currently inlining some methods such as #ifTrue:ifFalse: , #ifNil:ifNotNil: #and: #or: and there are not actual message sends, so adding an additional list of selector where literal symbol arguments are synthesized as blocks is not different to the cheating that the current compiler is doing. If an user wants to disable this inlining, there is currently a pragma for telling the compiler.

Do you want me to propose an experimental patch for testing this infrastructure?

Best regards,
Ronie

El mar., 21 ene. 2020 a las 19:56, Sven Van Caekenberghe (<[hidden email]>) escribió:
I also like the use of symbols but more for casual code.

I know using blocks can be faster, but the difference is not massive.

What I don't understand is why it is so super bad. Polymorphism will always be there, that is really powerful, when used wisely. I can't immediately see why one or the other would make analysis easier or better. Can you explain ?

> On 21 Jan 2020, at 23:37, Sean P. DeNigris <[hidden email]> wrote:
>
> ducasse wrote
>> in Pharo we should write
>> aCol do: [ :each | each store ]
>
> I always enjoyed the Symbol/Block polymorphism because I thought it was such
> a clever and visible example of the power of Smalltalk, and, let's face it,
> I'm lazy and enjoyed saving a few key strokes!
>
> That said, I had no idea that there was a dramatic performance cost. Also,
> the issues you raise about analysis seem important.
>
> Since people are free to still use it in their own projects, it doesn't seem
> to controversial. Can/should we add a lint rule? Can/should it be scoped to
> Pharo core?
>
>
>
> -----
> Cheers,
> Sean
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html
>

_,,,^..^,,,_

best, Eliot

Jan Vrany

Re: [Pharo-dev] Fed up

Eliot,

> 2) the lack of inline caches for #perform: (again, I am just guessing in
> > this case).
> >
>
> Right. There is only the first level method lookup cache so it has
> interpreter-like performance. The selector and classs of receiver have to
> be hashed and the first-level method lookup cache probed. Way slower than
> block activation. I will claim though that Cog/Spur OpenSmalltalk's JIT
> perform implementation is as good as or better than any other Smalltalk
> VM's. IIRC VW only machine coded/codes perform: and perform:with:

Do you have a benchmark for perform: et. al.? I'd be quite interested.
Last time I was on this topic, I struggled to come up with a benchmark
that would represent any hope-to-be-like-real-workload benchmark (and whose
results I could interpret :-)

Jan

Eliot Miranda-2

Re: [Pharo-dev] Fed up

Hi Jan,

well how about these? Scroll down past the definitions to see the benchmarker:. The point about benchFib is that 1 is added for every activation so the result is the number of calls required to evaluate it. Hence divide by the time and one gets activations per second. Very convenient. The variations are between a method using block recursion, a method on Integer where the value is accessed as self, a method using perform:, and two methods that access the value as an argument, one with a SmallInteger receiver and the other with nil as the receiver.

!BlockClosure methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 22:55'!

benchFib: arg

| benchFib |

benchFib := [:n| n < 2

ifTrue: [1]

ifFalse: [(benchFib value: n - 1) + (benchFib value: n - 2) + 1]].

^benchFib value: arg! !

!Integer methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 23:10'!

benchFib: n

^n < 2

ifTrue: [1]

ifFalse: [(self benchFib: n-1) + (self benchFib: n-2) + 1]! !

!Integer methodsFor: 'benchmarks' stamp: 'jm 11/20/1998 07:06'!

benchFib

^ self < 2

ifTrue: [1]

ifFalse: [(self-1) benchFib + (self-2) benchFib + 1]! !

!Symbol methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 22:57'!

benchFib: n

^n < 2

ifTrue: [1]

ifFalse: [(self perform: #benchFib: with: n - 1) + (self perform: #benchFib: with: n - 2) + 1]! !

!UndefinedObject methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 23:09'!

benchFib: n

^n < 2

ifTrue: [1]

ifFalse: [(self benchFib: n-1) + (self benchFib: n-2) + 1]! !

Collect result / seconds. Bigger is faster (more calls per second). Using Integer receivers involves a branch in the inline cache check, whereas all the others have no such jump. This is a 64-bit Squeak 5.2 image on my 2.9 GHz Intel Core i9 15" 2018 MacBookPro (thanks Doru!). And I'm using the SistaV1 bytecode set with full blocks (no block dispatch to reach the code for a particular block; each block is its own method).

| n collector blocks times |

n := 42.

collector := [:block| | t r | t := [r := block value] timeToRun. { t. r. (r * 1000.0 / t) rounded }].

blocks := { [n benchFib]. [n benchFib: n]. [nil benchFib: n]. [#benchFib: benchFib: n]. [[] benchFib: n] }.

times := blocks collect: collector; collect: collector. "twice to ensure measuring hot code".

(1 to: blocks size) collect: [:i| { (blocks at: i) decompile }, (times at: i), {((times at: i) last / times first last * 100) rounded }]

{{{ [n benchFib]} . 3734 . 866988873 . 232187700 . 100 } .

{{ [n benchFib: n]} . 3675 . 866988873 . 235915340 . 102 } .

{{ [nil benchFib: n]} . 3450 . 866988873 . 251301123 . 108 } .

{{ [#benchFib: benchFib: n]} . 5573 . 866988873 . 155569509 . 67} .

{{ [[] benchFib: n]} . 4930 . 866988873 . 175859812 . 76 }}

So... the clock is very granular (you see this at low N}.

blocks are 76% as fast as straight integers.

perform: is 67% as fast as straight integers (not too shabby; but then integers are crawling).

Fastest is sending to a non-immediate receiver and accessing the value as an argument.

The rest indicate that frame building is really expensive and dominates differences between accessing the value as the receiver or accessing it as an argument, whether there's a jump in the inline cache check, etc. This confirms what we found many years ago that if the ifTrue: [^1] branch can be done frameless, or that significant inlining can occur (as an adaptive optimizer can achieve) then things go a lot faster. But on the Cog execution architecture blocks and perform: are p.d.q. relative to vanilla sends.

On Thu, Jan 23, 2020 at 2:35 AM Jan Vrany <[hidden email]> wrote:

Eliot,

> 2) the lack of inline caches for #perform: (again, I am just guessing in
> > this case).
> >
>
> Right. There is only the first level method lookup cache so it has
> interpreter-like performance. The selector and classs of receiver have to
> be hashed and the first-level method lookup cache probed. Way slower than
> block activation. I will claim though that Cog/Spur OpenSmalltalk's JIT
> perform implementation is as good as or better than any other Smalltalk
> VM's. IIRC VW only machine coded/codes perform: and perform:with:

Do you have a benchmark for perform: et. al.? I'd be quite interested.
Last time I was on this topic, I struggled to come up with a benchmark
that would represent any hope-to-be-like-real-workload benchmark (and whose
results I could interpret :-)

Jan

_,,,^..^,,,_

best, Eliot

benchFibs.st (2K) Download Attachment