Hi Ronie, On Tue, Jan 21, 2020 at 5:48 PM Ronie Salgado <[hidden email]> wrote:
perform:, perform:with: et al (primitive 83), but not perform:withArguments: are completely implemented in machine code with no transition to C unless the method is not in the first-level method lookup cache. See genPrimitivePerform, genLookupForPerformNumArgs: and adjustArgumentsForPerform:.
Right. There is only the first level method lookup cache so it has interpreter-like performance. The selector and classs of receiver have to be hashed and the first-level method lookup cache probed. Way slower than block activation. I will claim though that Cog/Spur OpenSmalltalk's JIT perform implementation is as good as or better than any other Smalltalk VM's. IIRC VW only machine coded/codes perform: and perform:with:
_,,,^..^,,,_ best, Eliot |
Eliot, > 2) the lack of inline caches for #perform: (again, I am just guessing in > > this case). > > > > Right. There is only the first level method lookup cache so it has > interpreter-like performance. The selector and classs of receiver have to > be hashed and the first-level method lookup cache probed. Way slower than > block activation. I will claim though that Cog/Spur OpenSmalltalk's JIT > perform implementation is as good as or better than any other Smalltalk > VM's. IIRC VW only machine coded/codes perform: and perform:with: Do you have a benchmark for perform: et. al.? I'd be quite interested. Last time I was on this topic, I struggled to come up with a benchmark that would represent any hope-to-be-like-real-workload benchmark (and whose results I could interpret :-) Jan |
Hi Jan, well how about these? Scroll down past the definitions to see the benchmarker:. The point about benchFib is that 1 is added for every activation so the result is the number of calls required to evaluate it. Hence divide by the time and one gets activations per second. Very convenient. The variations are between a method using block recursion, a method on Integer where the value is accessed as self, a method using perform:, and two methods that access the value as an argument, one with a SmallInteger receiver and the other with nil as the receiver. !BlockClosure methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 22:55'! benchFib: arg | benchFib | benchFib := [:n| n < 2 ifTrue: [1] ifFalse: [(benchFib value: n - 1) + (benchFib value: n - 2) + 1]]. ^benchFib value: arg! ! !Integer methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 23:10'! benchFib: n ^n < 2 ifTrue: [1] ifFalse: [(self benchFib: n-1) + (self benchFib: n-2) + 1]! ! !Integer methodsFor: 'benchmarks' stamp: 'jm 11/20/1998 07:06'! benchFib ^ self < 2 ifTrue: [1] ifFalse: [(self-1) benchFib + (self-2) benchFib + 1]! ! !Symbol methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 22:57'! benchFib: n ^n < 2 ifTrue: [1] ifFalse: [(self perform: #benchFib: with: n - 1) + (self perform: #benchFib: with: n - 2) + 1]! ! !UndefinedObject methodsFor: 'benchmarks' stamp: 'eem 1/23/2020 23:09'! benchFib: n ^n < 2 ifTrue: [1] ifFalse: [(self benchFib: n-1) + (self benchFib: n-2) + 1]! ! Collect result / seconds. Bigger is faster (more calls per second). Using Integer receivers involves a branch in the inline cache check, whereas all the others have no such jump. This is a 64-bit Squeak 5.2 image on my 2.9 GHz Intel Core i9 15" 2018 MacBookPro (thanks Doru!). And I'm using the SistaV1 bytecode set with full blocks (no block dispatch to reach the code for a particular block; each block is its own method). | n collector blocks times | n := 42. collector := [:block| | t r | t := [r := block value] timeToRun. { t. r. (r * 1000.0 / t) rounded }]. blocks := { [n benchFib]. [n benchFib: n]. [nil benchFib: n]. [#benchFib: benchFib: n]. [[] benchFib: n] }. times := blocks collect: collector; collect: collector. "twice to ensure measuring hot code". (1 to: blocks size) collect: [:i| { (blocks at: i) decompile }, (times at: i), {((times at: i) last / times first last * 100) rounded }] {{{ [n benchFib]} . 3734 . 866988873 . 232187700 . 100 } . {{ [n benchFib: n]} . 3675 . 866988873 . 235915340 . 102 } . {{ [nil benchFib: n]} . 3450 . 866988873 . 251301123 . 108 } . {{ [#benchFib: benchFib: n]} . 5573 . 866988873 . 155569509 . 67} . {{ [[] benchFib: n]} . 4930 . 866988873 . 175859812 . 76 }} So... the clock is very granular (you see this at low N}. blocks are 76% as fast as straight integers. perform: is 67% as fast as straight integers (not too shabby; but then integers are crawling). Fastest is sending to a non-immediate receiver and accessing the value as an argument. The rest indicate that frame building is really expensive and dominates differences between accessing the value as the receiver or accessing it as an argument, whether there's a jump in the inline cache check, etc. This confirms what we found many years ago that if the ifTrue: [^1] branch can be done frameless, or that significant inlining can occur (as an adaptive optimizer can achieve) then things go a lot faster. But on the Cog execution architecture blocks and perform: are p.d.q. relative to vanilla sends. On Thu, Jan 23, 2020 at 2:35 AM Jan Vrany <[hidden email]> wrote:
_,,,^..^,,,_ best, Eliot benchFibs.st (2K) Download Attachment |
Free forum by Nabble | Edit this page |