Smalltalk › Cincom › VisualWorks

[vwnc] VM optimization: code inlining?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

33 messages Options

Mark Plas

[vwnc] VM optimization: code inlining?

Hi all,

This is what the Dictionary>>at: implementation looks like in vw7.4.1:

at: key

"Answer the value at key. If key is not found, raise a signal."

^self at: key ifAbsent: [self keyNotFoundErrorFor: #at: index: key]

The parameter for the ifAbsent: block is a copying block because it references #self & #key. This means that for each invocation of the #at: method, a copying block is created with two copied values. This is unfortunate because most the times, the key will be present when sending #at: and the ifAbsent block will never be evaluated.

To work around this I added a shared variable called #UndefinedValue on the Dictionary class and initialized it with Object new.

This allows me to the change the #at: method to this:

at: key

"Answer the value at key. If key is not found, raise a signal."

| result |

result := self at: key ifAbsent: [UndefinedValue].

result == UndefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].

^result

In this implementation, the copying block is missing, and when I run some benchmarks with it, it runs in 75% of the original time. There are other methods on Dictionary that can be optimized in the same way, so it might be interesting to introduce this kind of optimization in the base product.

But this was not the reason for the mail. It made me think that in many cases copying blocks (or even full blocks) are not needed. In the given example, there’s no need to make a copying block because, when it gets evaluated, the calling stack frame will still exist and the block could just have used the stack variables instead of copying these values locally.

It would be very interesting to see the VM evolve in such a way that it can detect these cases. I believe there are techniques available that can achieve this. Strongtalk does it by using method inlining, but perhaps there are other ways to do this? I can imagine that if this would find its way into all the smalltalk code, it could offer a significant speedup because blocks are used pretty much everywhere.

I wonder whether Cincom has plans to put these kinds of optimizations into their VM? Apart from this, is there documentation available describing the kinds of optimizations that happen in the VisualWorks VM?

Thanks,

Mark

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Terry Raymond

Re: [vwnc] VM optimization: code inlining?

Hmm, this made me think.

As a typical curly brace compiler has switches for optimization, suppose

we had something similar. I bet if method inlining were done at the

bytecode level that we would see a reasonable performance improvement

for some benchmarks, and applications.

Doing at the bytecode level would be a minimal impact on the code

base and would probably be fairly easy to do. There would be a

little additional work so the debugger was aware of the change.

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Plas
Sent: Wednesday, October 08, 2008 7:08 AM
To: [hidden email]
Subject: [vwnc] VM optimization: code inlining?

Hi all,

This is what the Dictionary>>at: implementation looks like in vw7.4.1:

at: key

"Answer the value at key. If key is not found, raise a signal."

^self at: key ifAbsent: [self keyNotFoundErrorFor: #at: index: key]

To work around this I added a shared variable called #UndefinedValue on the Dictionary class and initialized it with Object new.

This allows me to the change the #at: method to this:

at: key

"Answer the value at key. If key is not found, raise a signal."

| result |

result := self at: key ifAbsent: [UndefinedValue].

result == UndefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].

^result

Thanks,

Mark

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Steven Kelly

Re: [vwnc] VM optimization: code inlining?

In reply to this post by Mark Plas

Is native code cached at the level of granularity of CompiledMethod or BlockClosure? I think your idea would only be useful for the latter, and if the inlining was restricted to whole BlockClosures, and if the native code caching worked so that each inlined BlockClosure mapped to the same native code block.

Steve

From: [hidden email] [mailto:[hidden email]] On Behalf Of Terry Raymond
Sent: 08 October 2008 15:30
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Hmm, this made me think.

As a typical curly brace compiler has switches for optimization, suppose

we had something similar. I bet if method inlining were done at the

bytecode level that we would see a reasonable performance improvement

for some benchmarks, and applications.

Doing at the bytecode level would be a minimal impact on the code

base and would probably be fairly easy to do. There would be a

little additional work so the debugger was aware of the change.

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Plas
Sent: Wednesday, October 08, 2008 7:08 AM
To: [hidden email]
Subject: [vwnc] VM optimization: code inlining?

Hi all,

This is what the Dictionary>>at: implementation looks like in vw7.4.1:

at: key

"Answer the value at key. If key is not found, raise a signal."

^self at: key ifAbsent: [self keyNotFoundErrorFor: #at: index: key]

To work around this I added a shared variable called #UndefinedValue on the Dictionary class and initialized it with Object new.

This allows me to the change the #at: method to this:

at: key

"Answer the value at key. If key is not found, raise a signal."

| result |

result := self at: key ifAbsent: [UndefinedValue].

result == UndefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].

^result

Thanks,

Mark

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Terry Raymond

Re: [vwnc] VM optimization: code inlining?

Steve

Well, I guess I should provide more info.

What I was thinking of, is how the RB does the inlining.

If #at:ifAbsent: is defined as

at: aKey ifAbsent: aBlock

| index obj|

index := self findKeyOrNil: key.

obj := self basicAt: index.

^obj == nil

ifTrue: aBlock <- not really how it is written

ifFalse: [obj value]

Then inlining of #at: would turn into

at: key

| index obj|

index := self findKeyOrNil: key.

obj := self basicAt: index.

^obj == nil

ifTrue: [self keyNotFoundErrorFor: #at: index: key]

ifFalse: [obj value]

I think the inliner could be made smarter so it could transform block arguments

into the #ifTrue:ifFalse: form so they would no longer require a copying

block.

From: [hidden email] [mailto:[hidden email]] On Behalf Of Steven Kelly
Sent: Wednesday, October 08, 2008 8:50 AM
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Steve

From: [hidden email] [mailto:[hidden email]] On Behalf Of Terry Raymond
Sent: 08 October 2008 15:30
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Hmm, this made me think.

As a typical curly brace compiler has switches for optimization, suppose

we had something similar. I bet if method inlining were done at the

bytecode level that we would see a reasonable performance improvement

for some benchmarks, and applications.

Doing at the bytecode level would be a minimal impact on the code

base and would probably be fairly easy to do. There would be a

little additional work so the debugger was aware of the change.

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Plas
Sent: Wednesday, October 08, 2008 7:08 AM
To: [hidden email]
Subject: [vwnc] VM optimization: code inlining?

Hi all,

This is what the Dictionary>>at: implementation looks like in vw7.4.1:

at: key

"Answer the value at key. If key is not found, raise a signal."

^self at: key ifAbsent: [self keyNotFoundErrorFor: #at: index: key]

To work around this I added a shared variable called #UndefinedValue on the Dictionary class and initialized it with Object new.

This allows me to the change the #at: method to this:

at: key

"Answer the value at key. If key is not found, raise a signal."

| result |

result := self at: key ifAbsent: [UndefinedValue].

result == UndefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].

^result

Thanks,

Mark

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Alan Knight-2

Re: [vwnc] VM optimization: code inlining?

I find it remarkable that, with everything that goes on in a dictionary lookup, that the cost of a copying versus a clean block would be 25% of the time. I'm inclined to be suspicious of the benchmark.

At 09:56 AM 10/8/2008, Terry Raymond wrote:

Steve

Well, I guess I should provide more info.

What I was thinking of, is how the RB does the inlining.

If #at:ifAbsent: is defined as

at: aKey ifAbsent: aBlock

            | index obj|
            index := self findKeyOrNil: key.
            obj := self basicAt: index.
            ^obj == nil
                         ifTrue: aBlock               <- not really how it is written
                        ifFalse: [obj value]

Then inlining of #at: would turn into

at: key

            | index obj|
            index := self findKeyOrNil: key.
            obj := self basicAt: index.
            ^obj == nil
                         ifTrue: [self keyNotFoundErrorFor: #at: index: key]
                        ifFalse: [obj value]

I think the inliner could be made smarter so it could transform block arguments
into the #ifTrue:ifFalse: form so they would no longer require a copying
block.

Terry

===========================================================
Terry Raymond
Crafted Smalltalk
80 Lazywood Ln.
Tiverton, RI 02878
(401) 624-4517      [hidden email]
< http://www.craftedsmalltalk.com>
===========================================================

From: [hidden email] [[hidden email]] On Behalf Of Steven Kelly
Sent: Wednesday, October 08, 2008 8:50 AM
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Is native code cached at the level of granularity of CompiledMethod or BlockClosure? I think your idea would only be useful for the latter, and if the inlining was restricted to whole BlockClosures, and if the native code caching worked so that each inlined BlockClosure mapped to the same native code block.

Steve

From: [hidden email] [[hidden email]] On Behalf Of Terry Raymond
Sent: 08 October 2008 15:30
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Hmm, this made me think.

As a typical curly brace compiler has switches for optimization, suppose
we had something similar. I bet if method inlining were done at the
bytecode level that we would see a reasonable performance improvement
for some benchmarks, and applications.

Doing at the bytecode level would be a minimal impact on the code
base and would probably be fairly easy to do. There would be a
little additional work so the debugger was aware of the change.

Terry

===========================================================
Terry Raymond
Crafted Smalltalk
80 Lazywood Ln.
Tiverton, RI 02878
(401) 624-4517      [hidden email]
< http://www.craftedsmalltalk.com>
===========================================================

From: [hidden email] [[hidden email]] On Behalf Of Mark Plas
Sent: Wednesday, October 08, 2008 7:08 AM
To: [hidden email]
Subject: [vwnc] VM optimization: code inlining?

Hi all,

This is what the Dictionary>>at: implementation looks like in vw7.4.1:

at: key
      "Answer the value at key. If key is not found, raise a signal."

      ^self at: key ifAbsent: [self keyNotFoundErrorFor: #at: index: key]

The parameter for the ifAbsent: block is a copying block because it references #self & #key. This means that for each invocation of the #at: method, a copying block is created with two copied values. This is unfortunate because most the times, the key will be present when sending #at: and the ifAbsent block will never be evaluated.

To work around this I added a shared variable called #UndefinedValue on the Dictionary class and initialized it with Object new.
This allows me to the change the #at: method to this:

at: key
      "Answer the value at key. If key is not found, raise a signal."

      | result |
      result := self at: key ifAbsent: [UndefinedValue].
      result == UndefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].
      ^result

In this implementation, the copying block is missing, and when I run some benchmarks with it, it runs in 75% of the original time. There are other methods on Dictionary that can be optimized in the same way, so it might be interesting to introduce this kind of optimization in the base product.

But this was not the reason for the mail. It made me think that in many cases copying blocks (or even full blocks) are not needed. In the given example, theres no need to make a copying block because, when it gets evaluated, the calling stack frame will still exist and the block could just have used the stack variables instead of copying these values locally.

It would be very interesting to see the VM evolve in such a way that it can detect these cases. I believe there are techniques available that can achieve this. Strongtalk does it by using method inlining, but perhaps there are other ways to do this? I can imagine that if this would find its way into all the smalltalk code, it could offer a significant speedup because blocks are used pretty much everywhere.

I wonder whether Cincom has plans to put these kinds of optimizations into their VM? Apart from this, is there documentation available describing the kinds of optimizations that happen in the VisualWorks VM?

Thanks,
Mark

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Alan Knight [|], Engineering Manager, Cincom Smalltalk

[hidden email]

http://www.cincom.com/smalltalk

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Mark Plas

Re: [vwnc] VM optimization: code inlining?

In reply to this post by Terry Raymond

Hi Terry,

How would you know when to begin inlining a message send? You’d need to keep track of which parts of the code are ‘hot spots’. Would the VM do this for you and send back some notification messages? I also think you’d need to specialize a lot of methods. In the case of the #at: example, you cannot just inline the #at:ifAbsent: code because it is overridden several times in the subclasses of Dictionary. You’d have to duplicate the optimized version of #at: for each subclass that overrides #at:ifAbsent:. But if you do this on the smalltalk level, a developer would all of a sudden see methods appearing/disappearing in his classes.

Or are you speaking of method inlining at compile time?

Mark

From: [hidden email] [mailto:[hidden email]] On Behalf Of Terry Raymond
Sent: woensdag 8 oktober 2008 15:56
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Steve

Well, I guess I should provide more info.

What I was thinking of, is how the RB does the inlining.

If #at:ifAbsent: is defined as

at: aKey ifAbsent: aBlock

| index obj|

index := self findKeyOrNil: key.

obj := self basicAt: index.

^obj == nil

ifTrue: aBlock <- not really how it is written

ifFalse: [obj value]

Then inlining of #at: would turn into

at: key

| index obj|

index := self findKeyOrNil: key.

obj := self basicAt: index.

^obj == nil

ifTrue: [self keyNotFoundErrorFor: #at: index: key]

ifFalse: [obj value]

I think the inliner could be made smarter so it could transform block arguments

into the #ifTrue:ifFalse: form so they would no longer require a copying

block.

Steve

From: [hidden email] [mailto:[hidden email]] On Behalf Of Terry Raymond
Sent: 08 October 2008 15:30
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Hmm, this made me think.

As a typical curly brace compiler has switches for optimization, suppose

we had something similar. I bet if method inlining were done at the

bytecode level that we would see a reasonable performance improvement

for some benchmarks, and applications.

Doing at the bytecode level would be a minimal impact on the code

base and would probably be fairly easy to do. There would be a

little additional work so the debugger was aware of the change.

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Plas
Sent: Wednesday, October 08, 2008 7:08 AM
To: [hidden email]
Subject: [vwnc] VM optimization: code inlining?

Hi all,

This is what the Dictionary>>at: implementation looks like in vw7.4.1:

at: key

"Answer the value at key. If key is not found, raise a signal."

^self at: key ifAbsent: [self keyNotFoundErrorFor: #at: index: key]

To work around this I added a shared variable called #UndefinedValue on the Dictionary class and initialized it with Object new.

This allows me to the change the #at: method to this:

at: key

"Answer the value at key. If key is not found, raise a signal."

| result |

result := self at: key ifAbsent: [UndefinedValue].

result == UndefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].

^result

Thanks,

Mark

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Mark Plas

Re: [vwnc] VM optimization: code inlining?

In reply to this post by Alan Knight-2

Hi Alan,

Here’s my test at the end of the mail.

It was not so much a test of the performance of #at: in a Dictionary. I was trying see what the overhead of a copying block was compared to a clean block and tested it by modifying the #at: method. The reason for posting here was to find out if there is anything going on at Cincom concerning VM performance that could address the issue of avoiding copying blocks when they’re not needed.

Mark

Here’s the test:

|d|

d := IdentityDictionary new.

d at: #ababa put: 1.

d at: #abcacba put: 2.

d at: #wwabwaba put: 3.

d at: #tahhbg put: 4.

Time millisecondsToRun: [1000000 timesRepeat:

[d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg].]

From: [hidden email] [mailto:[hidden email]] On Behalf Of Alan Knight
Sent: woensdag 8 oktober 2008 17:02
To: Terry Raymond; [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Steve

Well, I guess I should provide more info.

What I was thinking of, is how the RB does the inlining.

If #at:ifAbsent: is defined as

at: aKey ifAbsent: aBlock

            | index obj|
            index := self findKeyOrNil: key.
            obj := self basicAt: index.
            ^obj == nil
                         ifTrue: aBlock               <- not really how it is written
                        ifFalse: [obj value]

Then inlining of #at: would turn into

at: key

            | index obj|
            index := self findKeyOrNil: key.
            obj := self basicAt: index.
            ^obj == nil
                         ifTrue: [self keyNotFoundErrorFor: #at: index: key]
                        ifFalse: [obj value]

I think the inliner could be made smarter so it could transform block arguments
into the #ifTrue:ifFalse: form so they would no longer require a copying
block.

Terry

===========================================================
Terry Raymond
Crafted Smalltalk
80 Lazywood Ln.
Tiverton, RI 02878
(401) 624-4517      [hidden email]
< http://www.craftedsmalltalk.com>
===========================================================

From: [hidden email] [[hidden email]] On Behalf Of Steven Kelly
Sent: Wednesday, October 08, 2008 8:50 AM
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Is native code cached at the level of granularity of CompiledMethod or BlockClosure? I think your idea would only be useful for the latter, and if the inlining was restricted to whole BlockClosures, and if the native code caching worked so that each inlined BlockClosure mapped to the same native code block.

Steve

From: [hidden email] [[hidden email]] On Behalf Of Terry Raymond
Sent: 08 October 2008 15:30
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Hmm, this made me think.

As a typical curly brace compiler has switches for optimization, suppose
we had something similar. I bet if method inlining were done at the
bytecode level that we would see a reasonable performance improvement
for some benchmarks, and applications.

Doing at the bytecode level would be a minimal impact on the code
base and would probably be fairly easy to do. There would be a
little additional work so the debugger was aware of the change.

Terry

===========================================================
Terry Raymond
Crafted Smalltalk
80 Lazywood Ln.
Tiverton, RI 02878
(401) 624-4517 [hidden email]
< http://www.craftedsmalltalk.com>
===========================================================

From: [hidden email] [[hidden email]] On Behalf Of Mark Plas
Sent: Wednesday, October 08, 2008 7:08 AM
To: [hidden email]
Subject: [vwnc] VM optimization: code inlining?

Hi all,

This is what the Dictionary>>at: implementation looks like in vw7.4.1:

at: key
      "Answer the value at key. If key is not found, raise a signal."

      ^self at: key ifAbsent: [self keyNotFoundErrorFor: #at: index: key]

The parameter for the ifAbsent: block is a copying block because it references #self & #key. This means that for each invocation of the #at: method, a copying block is created with two copied values. This is unfortunate because most the times, the key will be present when sending #at: and the ifAbsent block will never be evaluated.

To work around this I added a shared variable called #UndefinedValue on the Dictionary class and initialized it with Object new.
This allows me to the change the #at: method to this:

at: key
      "Answer the value at key. If key is not found, raise a signal."

      | result |
      result := self at: key ifAbsent: [UndefinedValue].
      result == UndefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].
      ^result

In this implementation, the copying block is missing, and when I run some benchmarks with it, it runs in 75% of the original time. There are other methods on Dictionary that can be optimized in the same way, so it might be interesting to introduce this kind of optimization in the base product.

But this was not the reason for the mail. It made me think that in many cases copying blocks (or even full blocks) are not needed. In the given example, theres no need to make a copying block because, when it gets evaluated, the calling stack frame will still exist and the block could just have used the stack variables instead of copying these values locally.

It would be very interesting to see the VM evolve in such a way that it can detect these cases. I believe there are techniques available that can achieve this. Strongtalk does it by using method inlining, but perhaps there are other ways to do this? I can imagine that if this would find its way into all the smalltalk code, it could offer a significant speedup because blocks are used pretty much everywhere.

I wonder whether Cincom has plans to put these kinds of optimizations into their VM? Apart from this, is there documentation available describing the kinds of optimizations that happen in the VisualWorks VM?

Thanks,
Mark

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Alan Knight [|], Engineering Manager, Cincom Smalltalk

[hidden email]

http://www.cincom.com/smalltalk

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Henrik Sperre Johansen

[vwnc] OrderedCollection speedup

In reply to this post by Terry Raymond

The inlining seems like fun, the example got me thinking though of
another Collection method I'd like see go faster, which again led me to
thinking about removeAll:, which again meant I was in a bad place for
the rest of the day. Boo!

Just so the day isn't entirely wasted, here's the one which it got me
thinking about:

OrderedCollection >>
removeAllSuchThat: aBlock
"Evaluate aBlock for each element of the receiver. Remove each
element for
which aBlock evaluates to true. "

| element newCollection index removed |
newCollection := self copyEmpty.
index := firstIndex.
removed := 0.
[index <= lastIndex]
whileTrue:
[element := self basicAt: index.
(aBlock value: element)
ifTrue:
[newCollection add: element.
self
basicAt: index - removed
put: (self basicAt: index + 1).
removed := removed + 1]
ifFalse:
[removed > 0
ifTrue:
[self
basicAt: index - removed
put: (self basicAt: index)]].
index := index + 1].
lastIndex - removed + 1
to: lastIndex
do: [:nilIndex | self basicAt: nilIndex put: nil].
lastIndex := lastIndex - removed.
^newCollection

Testing an example in workspace,
order := OrderedCollection new.
order2 := OrderedCollection new.
1 to: 100000 do: [:i | order add: i.
order2 add: i].
Time millisecondsToRun: [order removeAllSuchThat: [:each | each >= 20000
and: [each <= 50000]]]. 31242
Time millisecondsToRun: [order2 removeAllSuchThatNew: [:each | each >=
20000 and: [each <= 50000]]]. 68
order = order2 true

Cheers,
Henry

PS: Why is there no removeAllSuchThat:[] for all collections? It
replaces the common case aCollection removeAll: (aCollection
select:[:each | each someThing]) quite nicely, has a simple
implementation, and can be optimized much more easily for specific
collections than removeAll:, due to return value constraints posed by
the already existing VW implementations.
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Terry Raymond

Re: [vwnc] VM optimization: code inlining?

In reply to this post by Mark Plas

> -----Original Message-----
> From: Mark Plas [mailto:[hidden email]]
> Sent: Wednesday, October 08, 2008 11:04 AM
> To: Terry Raymond; [hidden email]
> Subject: RE: [vwnc] VM optimization: code inlining?
>
> Hi Terry,
>
>
>
> How would you know when to begin inlining a message send? You'd need to keep track of which parts of
> the code are 'hot spots'. Would the VM do this for you and send back some notification messages? I
> also think you'd need to specialize a lot of methods. In the case of the #at: example, you cannot
> just inline the #at:ifAbsent: code because it is overridden several times in the subclasses of
> Dictionary. You'd have to duplicate the optimized version of #at: for each subclass that overrides
> #at:ifAbsent:. But if you do this on the smalltalk level, a developer would all of a sudden see
> methods appearing/disappearing in his classes.

There are two problems here. The first is identifing what
methods should be inlined. The second is what does inlining
mean.

The inlining I am thinking of would be to use the RB inliner
to dynamically inline a method and compile it. The new compiled
method would replace the old one, but it would refer back to the
original non-lined source. This does create an additional problem
of the debugger source map. Additionally, the decompiled method
would not look at all like the orginal source.

Identifying which methods to inline and how deep to inline them
could be done manually or with a special tool. It all depends
how sophisticated you want to be.

Basically, I am suggesting that the inlining be done in the image,
so the resulting bytecodes produce improved performance, and not in the VM.

>
> Or are you speaking of method inlining at compile time?
>
>
>
> Mark
>
>
>
> ________________________________
>
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Terry Raymond
> Sent: woensdag 8 oktober 2008 15:56
> To: [hidden email]
> Subject: Re: [vwnc] VM optimization: code inlining?
>
>
>
> Steve
>
>
>
> Well, I guess I should provide more info.
>
>
>
> What I was thinking of, is how the RB does the inlining.
>
>
>
> If #at:ifAbsent: is defined as
>
>
>
> at: aKey ifAbsent: aBlock
>
>
>
> | index obj|
>
> index := self findKeyOrNil: key.
>
> obj := self basicAt: index.
>
> ^obj == nil
>
> ifTrue: aBlock <- not really how it is written
>
> ifFalse: [obj value]
>
>
>
> Then inlining of #at: would turn into
>
>
>
> at: key
>
>
>
> | index obj|
>
> index := self findKeyOrNil: key.
>
> obj := self basicAt: index.
>
> ^obj == nil
>
> ifTrue: [self keyNotFoundErrorFor: #at: index: key]
>
> ifFalse: [obj value]
>
>
>
> I think the inliner could be made smarter so it could transform block arguments
>
> into the #ifTrue:ifFalse: form so they would no longer require a copying
>
> block.
>
> Terry
>
> ===========================================================
> Terry Raymond
> Crafted Smalltalk
> 80 Lazywood Ln.
> Tiverton, RI 02878
> (401) 624-4517 [hidden email]
> <http://www.craftedsmalltalk.com>
> ===========================================================
>
> ________________________________
>
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Steven Kelly
> Sent: Wednesday, October 08, 2008 8:50 AM
> To: [hidden email]
> Subject: Re: [vwnc] VM optimization: code inlining?
>
>
>
> Is native code cached at the level of granularity of CompiledMethod or BlockClosure? I think your
> idea would only be useful for the latter, and if the inlining was restricted to whole BlockClosures,
> and if the native code caching worked so that each inlined BlockClosure mapped to the same native
> code block.
>
>
>
> Steve
>
>
>
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Terry Raymond
> Sent: 08 October 2008 15:30
> To: [hidden email]
> Subject: Re: [vwnc] VM optimization: code inlining?
>
>
>
> Hmm, this made me think.
>
>
>
> As a typical curly brace compiler has switches for optimization, suppose
>
> we had something similar. I bet if method inlining were done at the
>
> bytecode level that we would see a reasonable performance improvement
>
> for some benchmarks, and applications.
>
>
>
> Doing at the bytecode level would be a minimal impact on the code
>
> base and would probably be fairly easy to do. There would be a
>
> little additional work so the debugger was aware of the change.
>
>
>
> Terry
>
> ===========================================================
> Terry Raymond
> Crafted Smalltalk
> 80 Lazywood Ln.
> Tiverton, RI 02878
> (401) 624-4517 [hidden email]
> <http://www.craftedsmalltalk.com>
> ===========================================================
>
> ________________________________
>
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Plas
> Sent: Wednesday, October 08, 2008 7:08 AM
> To: [hidden email]
> Subject: [vwnc] VM optimization: code inlining?
>
>
>
> Hi all,
>
>
>
> This is what the Dictionary>>at: implementation looks like in vw7.4.1:
>
>
>
> at: key
>
> "Answer the value at key. If key is not found, raise a signal."
>
>
>
> ^self at: key ifAbsent: [self keyNotFoundErrorFor: #at: index: key]
>
>
>
> The parameter for the ifAbsent: block is a copying block because it references #self & #key. This
> means that for each invocation of the #at: method, a copying block is created with two copied values.
> This is unfortunate because most the times, the key will be present when sending #at: and the
> ifAbsent block will never be evaluated.
>
>
>
> To work around this I added a shared variable called #UndefinedValue on the Dictionary class and
> initialized it with Object new.
>
> This allows me to the change the #at: method to this:
>
>
>
>
>
> at: key
>
> "Answer the value at key. If key is not found, raise a signal."
>
>
>
> | result |
>
> result := self at: key ifAbsent: [UndefinedValue].
>
> result == UndefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].
>
> ^result
>
>
>
> In this implementation, the copying block is missing, and when I run some benchmarks with it, it runs
> in 75% of the original time. There are other methods on Dictionary that can be optimized in the same
> way, so it might be interesting to introduce this kind of optimization in the base product.
>
>
>
> But this was not the reason for the mail. It made me think that in many cases copying blocks (or even
> full blocks) are not needed. In the given example, there's no need to make a copying block because,
> when it gets evaluated, the calling stack frame will still exist and the block could just have used
> the stack variables instead of copying these values locally.
>
>
>
> It would be very interesting to see the VM evolve in such a way that it can detect these cases. I
> believe there are techniques available that can achieve this. Strongtalk does it by using method
> inlining, but perhaps there are other ways to do this? I can imagine that if this would find its way
> into all the smalltalk code, it could offer a significant speedup because blocks are used pretty much
> everywhere.
>
>
>
> I wonder whether Cincom has plans to put these kinds of optimizations into their VM? Apart from this,
> is there documentation available describing the kinds of optimizations that happen in the VisualWorks
> VM?
>
>
>
> Thanks,
>
> Mark
>
>

Terry

===========================================================
Terry Raymond
Crafted Smalltalk
80 Lazywood Ln.
Tiverton, RI 02878
(401) 624-4517 [hidden email]
<http://www.craftedsmalltalk.com>
===========================================================

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Henrik Sperre Johansen

Re: [vwnc] OrderedCollection speedup

In reply to this post by Henrik Sperre Johansen

To those who care, here's the workspace test for another notoriously
"slow to remove from" collection, with removeAllSuchThat: implemented in
a similar way:

set := Set withAll: (2500 to: 20000).
set2 := Set withAll: (2500 to: 20000).
Time millisecondsToRun: [set removeAll: (set select: [:each | each >=
7500 and: [ each <= 12500]])]. 30523
Time millisecondsToRun: [set2 removeAllSuchThat: [:each | each >= 20000
and: [ each <= 50000]]] 61
set asOrderedCollection = set2 asOrderedCollection true

I planned to run them with the same sizes as the OrderedCollection
example, but I gave up on the removeAll: after a minute.
removeAllSuchThat: finished in 1.5 seconds...

Cheers,
Henry

PS. I played around some more with Pharo after posting, and even Squeak
already has this for OrderedCollection. They're missing the equivalent
for Set for some reason though, probably not used often.
PPS. Is it just me it bothers that in such a mature technology as
VisualWorks, I sometimes have to stop my flow of thoughts -> code to
think about if the collection type I want to remove objects from will
make performance blow up in my face?

Henrik Johansen wrote:

> The inlining seems like fun, the example got me thinking though of
> another Collection method I'd like see go faster, which again led me to
> thinking about removeAll:, which again meant I was in a bad place for
> the rest of the day. Boo!
>
> Just so the day isn't entirely wasted, here's the one which it got me
> thinking about:
>
> OrderedCollection >>
> removeAllSuchThat: aBlock
> "Evaluate aBlock for each element of the receiver. Remove each
> element for
> which aBlock evaluates to true. "
>
> | element newCollection index removed |
> newCollection := self copyEmpty.
> index := firstIndex.
> removed := 0.
> [index <= lastIndex]
> whileTrue:
> [element := self basicAt: index.
> (aBlock value: element)
> ifTrue:
> [newCollection add: element.
> self
> basicAt: index - removed
> put: (self basicAt: index + 1).
> removed := removed + 1]
> ifFalse:
> [removed > 0
> ifTrue:
> [self
> basicAt: index - removed
> put: (self basicAt: index)]].
> index := index + 1].
> lastIndex - removed + 1
> to: lastIndex
> do: [:nilIndex | self basicAt: nilIndex put: nil].
> lastIndex := lastIndex - removed.
> ^newCollection
>
> Testing an example in workspace,
> order := OrderedCollection new.
> order2 := OrderedCollection new.
> 1 to: 100000 do: [:i | order add: i.
> order2 add: i].
> Time millisecondsToRun: [order removeAllSuchThat: [:each | each >= 20000
> and: [each <= 50000]]]. 31242
> Time millisecondsToRun: [order2 removeAllSuchThatNew: [:each | each >=
> 20000 and: [each <= 50000]]]. 68
> order = order2 true
>
> Cheers,
> Henry
>
> PS: Why is there no removeAllSuchThat:[] for all collections? It
> replaces the common case aCollection removeAll: (aCollection
> select:[:each | each someThing]) quite nicely, has a simple
> implementation, and can be optimized much more easily for specific
> collections than removeAll:, due to return value constraints posed by
> the already existing VW implementations.
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
>
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Andres Valloud-6

Re: [vwnc] OrderedCollection speedup

Henrik,

Note that the example you use with sets exacerbates a known limitation
of the current implementation. When Set encounters a hash value that
could be an identityHash value, it assumes it is and scales it. This is
so that these don't cluster at the beginning of the set's table.
However, this backfires when most elements in the set have consecutive
hash values because the scaling poisons the hash table and so most
operations become linear search. This can be clearly seen by doing the
following:

Time millisecondsToRun: [set removeAll: (set select: [:each | each >=
7500 and: [ each <= 12500]])] 71789

Time millisecondsToRun: [7500 to: 12500 do: [:each | set remove: each
ifAbsent: [nil]]] 70104

In other words, the select: takes basically no time, and most of the
effort is spent rehashing a hashed collection that has degenerated into
a linear search collection.

If, for example, we use numbers above 16383, then the situation changes
quite a bit.

set := Set withAll: (16384 to: 32768).

Time millisecondsToRun: [set removeAll: (set select: [:each | each >=
17500 and: [ each <= 22500]])] 6676

Time millisecondsToRun: [17500 to: 22500 do: [:each | set remove: each
ifAbsent: [nil]]] 6837

In other words, the time it takes to do the select: is still within
measurement error, but the time it takes to complete an operation of
roughly the same size has decreased by a factor of 11x. This is because
now Integer>>hash is being left as is and is not scaled by Set.

<digression>
The choice to scale values in a Set is driven by the fact that using a
simple Set to hold many objects that implement hash as identityHash will
suffer performance problems without it. However, it introduces problems
when one is using perfect hash functions as the scaling can break the
assumptions made when designing the implementation of hash. So, at
least in my personal opinion, what we could do is introduce another
message such as scaledIdentityHash (or whatever) that returns an already
scaled identityHash value. Then, we could implement
Object>>scaledIdentityHash as identityHash bitShift: 14 and reimplement
Object>>hash as ^self scaledIdentityHash. This scales identityHash
values inside sets, and avoid the issues with perfect hash functions.
</digression>

You may want to redo your tests taking this into account. Of course, if
you want to mass remove objects from a Set, you may want to keep a copy
of removeAllSuchThat: around to further speed up things depending on
usage.

Finally, the selector removeAllSuchThat: sounds like something I could
have come up with. However, now I think eject: is much nicer. The
original counterpart to eject: was keep:, but as I am writing I get the
idea that maybe it should be elect:. This makes eject: and elect: be
like reject: and select:, except that the receivers are modified as
opposed to answering a copy.

Andres.

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On
Behalf Of Henrik Johansen
Sent: Wednesday, October 08, 2008 2:51 PM
To: [hidden email]
Subject: Re: [vwnc] OrderedCollection speedup

To those who care, here's the workspace test for another notoriously
"slow to remove from" collection, with removeAllSuchThat: implemented in
a similar way:

set := Set withAll: (2500 to: 20000).
set2 := Set withAll: (2500 to: 20000).
Time millisecondsToRun: [set removeAll: (set select: [:each | each >=
7500 and: [ each <= 12500]])]. 30523 Time millisecondsToRun: [set2
removeAllSuchThat: [:each | each >= 20000
and: [ each <= 50000]]] 61
set asOrderedCollection = set2 asOrderedCollection true

I planned to run them with the same sizes as the OrderedCollection
example, but I gave up on the removeAll: after a minute.
removeAllSuchThat: finished in 1.5 seconds...

Cheers,
Henry

PS. I played around some more with Pharo after posting, and even Squeak
already has this for OrderedCollection. They're missing the equivalent
for Set for some reason though, probably not used often.
PPS. Is it just me it bothers that in such a mature technology as
VisualWorks, I sometimes have to stop my flow of thoughts -> code to
think about if the collection type I want to remove objects from will
make performance blow up in my face?

Henrik Johansen wrote:

> The inlining seems like fun, the example got me thinking though of
> another Collection method I'd like see go faster, which again led me
> to thinking about removeAll:, which again meant I was in a bad place
> for the rest of the day. Boo!
>
> Just so the day isn't entirely wasted, here's the one which it got me
> thinking about:
>
> OrderedCollection >>
> removeAllSuchThat: aBlock
> "Evaluate aBlock for each element of the receiver. Remove each
> element for
> which aBlock evaluates to true. "
>
> | element newCollection index removed |
> newCollection := self copyEmpty.
> index := firstIndex.
> removed := 0.
> [index <= lastIndex]
> whileTrue:
> [element := self basicAt: index.
> (aBlock value: element)
> ifTrue:
> [newCollection add: element.
> self
> basicAt: index - removed
> put: (self basicAt: index + 1).
> removed := removed + 1]
> ifFalse:
> [removed > 0
> ifTrue:
> [self
> basicAt: index - removed
> put: (self basicAt: index)]].
> index := index + 1].
> lastIndex - removed + 1
> to: lastIndex
> do: [:nilIndex | self basicAt: nilIndex put: nil].
> lastIndex := lastIndex - removed.
> ^newCollection
>
> Testing an example in workspace,
> order := OrderedCollection new.
> order2 := OrderedCollection new.
> 1 to: 100000 do: [:i | order add: i.
> order2 add: i].
> Time millisecondsToRun: [order removeAllSuchThat: [:each | each >=
> 20000
> and: [each <= 50000]]]. 31242
> Time millisecondsToRun: [order2 removeAllSuchThatNew: [:each | each >=

> 20000 and: [each <= 50000]]]. 68 order = order2 true
>
> Cheers,
> Henry
>
> PS: Why is there no removeAllSuchThat:[] for all collections? It
> replaces the common case aCollection removeAll: (aCollection
> select:[:each | each someThing]) quite nicely, has a simple
> implementation, and can be optimized much more easily for specific
> collections than removeAll:, due to return value constraints posed by
> the already existing VW implementations.
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
>
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Stefan Schmiedl

Re: [vwnc] OrderedCollection speedup

On Wed, 8 Oct 2008 18:40:46 -0400
"Valloud, Andres" <[hidden email]> wrote:

> However, now I think eject: is much nicer.

Now I am going to dream of ejected objects dangling from tiny
parachutes and slowly drifting downwards ...

Good thing you didn't choose evict:, because homeless objects would just
make me weep with sorrow.

SCNR
s.
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Michael Lucas-Smith-2

Re: [vwnc] OrderedCollection speedup

Stefan Schmiedl wrote:

> On Wed, 8 Oct 2008 18:40:46 -0400
> "Valloud, Andres" <[hidden email]> wrote:
>
>
>> However, now I think eject: is much nicer.
>>
>
> Now I am going to dream of ejected objects dangling from tiny
> parachutes and slowly drifting downwards ...
>
> Good thing you didn't choose evict:, because homeless objects would just
> make me weep with sorrow.
>
>

If we have evict:, we'll need a bailout: before too many objects get
evicted.
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Michel Tilman

Re: [vwnc] VM optimization: code inlining?

In reply to this post by Alan Knight-2

Hello,

Out of curiosity i tried (using Mark's benchmark) the following experiments.

1) On my machine (iMac intel OSX / VW7.6) the time for his solution is about 2.8".

2) Adding (the "k" argument is of course redundant here)

at: key ifAbsent: aBlock receiver: receiver key: k

| index obj|

index := self findKeyOrNil: key.

obj := self basicAt: index.

^obj == nil

ifTrue: [aBlock value: receiver value: k ]

ifFalse: [obj value]

and using

at: key

^ self at: key ifAbsent: [ :rcvr :k | a keyNotFoundErrorFor: #at: index: b ] receiver: self key: key

results in about 2.3".

We have a clean block here, and little overhead.

3) Using a more 'generic' variant

at: key ifAbsent: aBlock context: context

| index obj|

index := self findKeyOrNil: key.

obj := self basicAt: index.

^obj == nil

ifTrue: [aBlock value: context ]

ifFalse: [obj value]

with

at: key

"Answer the value at key. If key is not found, raise a signal."

^ self at: key ifAbsent: [ :ctxt | ctxt first keyNotFoundErrorFor: #at: index: ctxt last ] context: (Array with: self with: key)

takes about 3.6".

Again we have a clean block but added the overhead of dealing with arrays.

4) Adding instance variables "a" and "b" to dictionaries to force the use of copying blocks as follows (and to avoid having to thinker with the compiler):

at: key

a := self.

b := key.

^ self at: key ifAbsent: [ a keyNotFoundErrorFor: #at: index: b ]

yields a time of about 3.4".

(We ignore all threading issues here).

If we (could) remove the assignments to "a" and "b" (for the sake of the argument), then execution time drops to about 2.9".

michel

On 08 Oct 2008, at 17:01, Alan Knight wrote:

I find it remarkable that, with everything that goes on in a dictionary lookup, that the cost of a copying versus a clean block would be 25% of the time. I'm inclined to be suspicious of the benchmark.

At 09:56 AM 10/8/2008, Terry Raymond wrote:

Steve

Well, I guess I should provide more info.

What I was thinking of, is how the RB does the inlining.

If #at:ifAbsent: is defined as

at: aKey ifAbsent: aBlock

            | index obj|
            index := self findKeyOrNil: key.
            obj := self basicAt: index.
            ^obj == nil
                         ifTrue: aBlock               <- not really how it is written
                        ifFalse: [obj value]

Then inlining of #at: would turn into

at: key

            | index obj|
            index := self findKeyOrNil: key.
            obj := self basicAt: index.
            ^obj == nil
                         ifTrue: [self keyNotFoundErrorFor: #at: index: key]
                        ifFalse: [obj value]

I think the inliner could be made smarter so it could transform block arguments
into the #ifTrue:ifFalse: form so they would no longer require a copying
block.

Terry

===========================================================
Terry Raymond
Crafted Smalltalk
80 Lazywood Ln.
Tiverton, RI 02878
(401) 624-4517      [hidden email]
< http://www.craftedsmalltalk.com>
===========================================================

From: [hidden email] [[hidden email]] On Behalf Of Steven Kelly
Sent: Wednesday, October 08, 2008 8:50 AM
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Is native code cached at the level of granularity of CompiledMethod or BlockClosure? I think your idea would only be useful for the latter, and if the inlining was restricted to whole BlockClosures, and if the native code caching worked so that each inlined BlockClosure mapped to the same native code block.

Steve

From: [hidden email] [[hidden email]] On Behalf Of Terry Raymond
Sent: 08 October 2008 15:30
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Hmm, this made me think.

As a typical curly brace compiler has switches for optimization, suppose
we had something similar. I bet if method inlining were done at the
bytecode level that we would see a reasonable performance improvement
for some benchmarks, and applications.

Doing at the bytecode level would be a minimal impact on the code
base and would probably be fairly easy to do. There would be a
little additional work so the debugger was aware of the change.

Terry

===========================================================
Terry Raymond
Crafted Smalltalk
80 Lazywood Ln.
Tiverton, RI 02878
(401) 624-4517      [hidden email]
< http://www.craftedsmalltalk.com>
===========================================================

From: [hidden email] [[hidden email]] On Behalf Of Mark Plas
Sent: Wednesday, October 08, 2008 7:08 AM
To: [hidden email]
Subject: [vwnc] VM optimization: code inlining?

Hi all,

This is what the Dictionary>>at: implementation looks like in vw7.4.1:

at: key
      "Answer the value at key. If key is not found, raise a signal."

      ^self at: key ifAbsent: [self keyNotFoundErrorFor: #at: index: key]

The parameter for the ifAbsent: block is a copying block because it references #self & #key. This means that for each invocation of the #at: method, a copying block is created with two copied values. This is unfortunate because most the times, the key will be present when sending #at: and the ifAbsent block will never be evaluated.

To work around this I added a shared variable called #UndefinedValue on the Dictionary class and initialized it with Object new.
This allows me to the change the #at: method to this:

at: key
      "Answer the value at key. If key is not found, raise a signal."

      | result |
      result := self at: key ifAbsent: [UndefinedValue].
      result == UndefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].
      ^result

In this implementation, the copying block is missing, and when I run some benchmarks with it, it runs in 75% of the original time. There are other methods on Dictionary that can be optimized in the same way, so it might be interesting to introduce this kind of optimization in the base product.

But this was not the reason for the mail. It made me think that in many cases copying blocks (or even full blocks) are not needed. In the given example, theres no need to make a copying block because, when it gets evaluated, the calling stack frame will still exist and the block could just have used the stack variables instead of copying these values locally.

It would be very interesting to see the VM evolve in such a way that it can detect these cases. I believe there are techniques available that can achieve this. Strongtalk does it by using method inlining, but perhaps there are other ways to do this? I can imagine that if this would find its way into all the smalltalk code, it could offer a significant speedup because blocks are used pretty much everywhere.

I wonder whether Cincom has plans to put these kinds of optimizations into their VM? Apart from this, is there documentation available describing the kinds of optimizations that happen in the VisualWorks VM?

Thanks,
Mark

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

--

Alan Knight [|], Engineering Manager, Cincom Smalltalk

[hidden email]

[hidden email]

http://www.cincom.com/smalltalk
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Martin McClure

Re: [vwnc] VM optimization: code inlining?

In reply to this post by Alan Knight-2

Alan Knight wrote:
> I find it remarkable that, with everything that goes on in a dictionary
> lookup, that the cost of a copying versus a clean block would be 25% of
> the time. I'm inclined to be suspicious of the benchmark.
>

Rather more than 25%, actually. Benchmarks are often surprising...

The time to run an empty loop:

^Time microsecondsToRun: [10000000 timesRepeat: []]
30907
30870
29836

average: 30538

| dict |
dict := Dictionary new.
dict at: #foo put: #bar.
^Time microsecondsToRun: [10000000 timesRepeat: [dict at: #foo]]
1123834
1102892
1245824

average: 1157517, less loop overhead 1126979
for an average of 8.87 lookups per microsecond.

| dict |
dict := TestDictionary new.
dict at: #foo put: #bar.
^Time microsecondsToRun: [10000000 timesRepeat: [dict at: #foo]]
734595
758221
747092

average: 746636, less loop overhead 716098
for an average of 13.96 lookups per microsecond

That's a *57%* increase in lookup speed.

The difference between Dictionary and TestDictionary is Mark's change,
with the exception of using an instance variable instead of a shared
variable, since shared variable access is quite slow compared to instvar
access.

Looking at why there is this much speed difference, the
MultiTimeProfiler tells us:

Dictionary
92 samples in 1.62 seconds real time
1452 scavenges, 0 incGCs

TestDictionary
57 samples in 1.02 seconds real time
792 scavenges, 0 incGCs

So we're saving a lot of scavenges, which implies we're creating fewer
(or smaller) objects. But we're still creating objects.

The profiler reports for the ScavengeProcess:

Dictionary
33.0% * 92 samples * 17650 microseconds / sample = 535854
microseconds spent in scavenge.

TestDictionary
17.3% * 57 samples * 17820 microseconds / sample = 175723
microseconds spent in scavenge.

The difference is 360131 microseconds. So the difference in benchmarks
is almost entirely accounted for by savings in GC.

Let's see if we can reduce scavenges even further.

Test2Dictionary>>at: key

| index obj |
index := self findKeyOrNil: key.
obj := self basicAt: index.
^obj == nil
ifTrue: [self keyNotFoundErrorFor: #at: index: key]
ifFalse: [obj value]

The profiler now reports that we've eliminated scavenges altogether.
Excellent.

The timings:

| dict |
dict := Test2Dictionary new.
dict at: #foo put: #bar.
^Time microsecondsToRun: [10000000 timesRepeat: [dict at: #foo]]
585597
583622
575339

average: 581519, less loop overhead 550981
for an average of 18.15 lookups per microsecond.

That's a 105% speedup. Test2Dictionary>>at: is more than twice as fast
as Dictionary>>at: in the non-failure case.

A self-style VM could do this kind of inlining automatically. The
bytecode-level inlining that Terry's talking about could also do it,
though without the frequency feedback it's a bit harder to know where to
do this, and a mechanism would be needed to keep track of all of the
methods that had inlined a particular method in order to automatically
recompile all inlining methods whenever an inlined method is recompiled.

Regards,

-Martin
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Henrik Sperre Johansen

Re: [vwnc] VM optimization: code inlining?

So technically, copying takes just a little bit longer than a clean
block, it's the cleanup that takes alot of time.
Makes me wonder how much practical impact this would have in a real
situation, when regarding the amount of garbage you'd create elsewhere
in the course of doing 10 million Dictionary lookups ;)

Cheers,
Henry

Martin McClure wrote:

> Alan Knight wrote:
>
>> I find it remarkable that, with everything that goes on in a dictionary
>> lookup, that the cost of a copying versus a clean block would be 25% of
>> the time. I'm inclined to be suspicious of the benchmark.
>>
>>
>
> Rather more than 25%, actually. Benchmarks are often surprising...
>
> The time to run an empty loop:
>
> ^Time microsecondsToRun: [10000000 timesRepeat: []]
> 30907
> 30870
> 29836
>
> average: 30538
>
>
> | dict |
> dict := Dictionary new.
> dict at: #foo put: #bar.
> ^Time microsecondsToRun: [10000000 timesRepeat: [dict at: #foo]]
> 1123834
> 1102892
> 1245824
>
> average: 1157517, less loop overhead 1126979
> for an average of 8.87 lookups per microsecond.
>
> | dict |
> dict := TestDictionary new.
> dict at: #foo put: #bar.
> ^Time microsecondsToRun: [10000000 timesRepeat: [dict at: #foo]]
> 734595
> 758221
> 747092
>
> average: 746636, less loop overhead 716098
> for an average of 13.96 lookups per microsecond
>
> That's a *57%* increase in lookup speed.
>
>
> The difference between Dictionary and TestDictionary is Mark's change,
> with the exception of using an instance variable instead of a shared
> variable, since shared variable access is quite slow compared to instvar
> access.
>
> Looking at why there is this much speed difference, the
> MultiTimeProfiler tells us:
>
> Dictionary
> 92 samples in 1.62 seconds real time
> 1452 scavenges, 0 incGCs
>
> TestDictionary
> 57 samples in 1.02 seconds real time
> 792 scavenges, 0 incGCs
>
> So we're saving a lot of scavenges, which implies we're creating fewer
> (or smaller) objects. But we're still creating objects.
>
> The profiler reports for the ScavengeProcess:
>
> Dictionary
> 33.0% * 92 samples * 17650 microseconds / sample = 535854
> microseconds spent in scavenge.
>
> TestDictionary
> 17.3% * 57 samples * 17820 microseconds / sample = 175723
> microseconds spent in scavenge.
>
> The difference is 360131 microseconds. So the difference in benchmarks
> is almost entirely accounted for by savings in GC.
>
> Let's see if we can reduce scavenges even further.
>
> Test2Dictionary>>at: key
>
> | index obj |
> index := self findKeyOrNil: key.
> obj := self basicAt: index.
> ^obj == nil
> ifTrue: [self keyNotFoundErrorFor: #at: index: key]
> ifFalse: [obj value]
>
>
> The profiler now reports that we've eliminated scavenges altogether.
> Excellent.
>
> The timings:
>
> | dict |
> dict := Test2Dictionary new.
> dict at: #foo put: #bar.
> ^Time microsecondsToRun: [10000000 timesRepeat: [dict at: #foo]]
> 585597
> 583622
> 575339
>
> average: 581519, less loop overhead 550981
> for an average of 18.15 lookups per microsecond.
>
> That's a 105% speedup. Test2Dictionary>>at: is more than twice as fast
> as Dictionary>>at: in the non-failure case.
>
>
> A self-style VM could do this kind of inlining automatically. The
> bytecode-level inlining that Terry's talking about could also do it,
> though without the frequency feedback it's a bit harder to know where to
> do this, and a mechanism would be needed to keep track of all of the
> methods that had inlined a particular method in order to automatically
> recompile all inlining methods whenever an inlined method is recompiled.
>
> Regards,
>
> -Martin
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
>
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Eliot Miranda-2

Re: [vwnc] VM optimization: code inlining?

In reply to this post by Martin McClure

On Fri, Oct 10, 2008 at 1:03 PM, Martin McClure <[hidden email]> wrote:

[excellent analysis snipped]

That's a 105% speedup. Test2Dictionary>>at: is more than twice as fast
as Dictionary>>at: in the non-failure case.

A self-style VM could do this kind of inlining automatically.

Not only that, it would use an at: that would avoid the bounds check and isInteger check because it would be able to prove that the result of findKeyOrNil: would be an index in range. So the speed ups would be even higher than achieved by this experiment.

The
bytecode-level inlining that Terry's talking about could also do it,
though without the frequency feedback it's a bit harder to know where to
do this, and a mechanism would be needed to keep track of all of the
methods that had inlined a particular method in order to automatically
recompile all inlining methods whenever an inlined method is recompiled.

A Self-styl;e VM can still be implemented to target bytecode and have the VM generate the machine code. The VM needs additional bytecodes to express things like non-bounds-checking at: etc.

But you and I know this. We've been talking about it for a while now. I'm looking forward to working on this in Cog.

Eliot

Qwaq, Inc.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Michel Tilman

Re: [vwnc] VM optimization: code inlining?

Hello,

I looked again at my small tests (I noticed my previous mail was a bit confusing, the Dictionary>>#at: method also uses copying blocks of course). I also tried to have a rough idea of scavenges using AllocationProfiler:

1) standard #at: about 6.2" - about 5800 scavenges, and 1 incGC

2) Mark's #at: about 2,8" - 0 scavenges and 0 incGCC

3) variant on standard (with b instance variable as below): about 3.2" - about 3170 scavenges and 1 incGC

at: key

"Answer the value at key. If key is not found, raise a signal."

b := key.

^ self at: key ifAbsent: [ self keyNotFoundErrorFor: #at: index: b ]

In the first case the copiedValues is an array with two elements: the dictionary (self) and the key. In the third case the copiedValues field only uses the dictionary, hence it can be stored directly, without an extra array object. So I was curious how many occurrences there are of #makeCopyingBlock:count: and I grouped them by count (standard vw 7.6 + some additional classes). Case 3 falls in the first group; case 1 in the second group.

1 5093

2 2401

3 842

4 253

5 82

6 25

7 16

8 5

9 2

10 3

11 1

13 1

17 1

Would it make sense to try to store the copied values in another way?

Regards,

michel

On 11 Oct 2008, at 00:38, Eliot Miranda wrote:

On Fri, Oct 10, 2008 at 1:03 PM, Martin McClure <[hidden email]> wrote:

[excellent analysis snipped]

That's a 105% speedup. Test2Dictionary>>at: is more than twice as fast
as Dictionary>>at: in the non-failure case.

A self-style VM could do this kind of inlining automatically.

Not only that, it would use an at: that would avoid the bounds check and isInteger check because it would be able to prove that the result of findKeyOrNil: would be an index in range. So the speed ups would be even higher than achieved by this experiment.

The
bytecode-level inlining that Terry's talking about could also do it,
though without the frequency feedback it's a bit harder to know where to
do this, and a mechanism would be needed to keep track of all of the
methods that had inlined a particular method in order to automatically
recompile all inlining methods whenever an inlined method is recompiled.

A Self-styl;e VM can still be implemented to target bytecode and have the VM generate the machine code. The VM needs additional bytecodes to express things like non-bounds-checking at: etc.

But you and I know this. We've been talking about it for a while now. I'm looking forward to working on this in Cog.

Eliot
Qwaq, Inc.
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Joachim Geidel

Re: [vwnc] VM optimization: code inlining?

> I looked again at my small tests (I noticed my previous mail was a bit
> confusing, the Dictionary>>#at: method also uses copying blocks of
> course). I also tried to have a rough idea of scavenges using
> AllocationProfiler:
>
> 1) standard #at: about 6.2" - about 5800 scavenges, and 1 incGC
> 2) Mark's #at: about 2,8" - 0 scavenges and 0 incGCC
> 3) variant on standard (with b instance variable as below): about 3.2"
> - about 3170 scavenges and 1 incGC

Reading this thread, I have mixed feelings. Yes, it is a good idea to avoid
garbage. BUT (and I meant to write this in capital letters ;-) ) the large
performance differences probably depend on the configuration of ObjectMemory
as well. Given that the default values are still relatively small, it might
be interesting to see the effect of setting the size of Eden and
SurvivorSpace to larger values. This should reduce the number of scavenges.
E.g., setting the sizes of Eden and SurvivorSpace to 10 or 20 times the
default value should reduce the number of scavenges by a factor of 10 or 20.
It might also reduce tenuring objects in OldSpace, thus avoiding the incGC
runs as well.

Would the difference in execution times still be as large, or would it
shrink to less impressive values? If garbage collection (scavenges in this
case) are expensive, one can either avoid it by not creating garbage, or
make it more efficient. So far, the proposed solutions have concentrated on
the first of the two options only.

Having said that, the most efficient solution will of course be a
combination of both.

Joachim Geidel

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Mark Plas

Re: [vwnc] VM optimization: code inlining?

In reply to this post by Michel Tilman

To Martin: You wrote “since shared variable access is quite slow compared to instvar access.”, so I changed my method to use an instvar, but it was a bit slower than the version with a shared variable. What would be the reason for this?

Then I also did some experiments where I inlined the at:ifAbsent: in the at: method:, avoiding a message send and a copying block:

at: key

| index obj |

index := self findKeyOrNil: key.

obj := self basicAt: index.

^obj == nil

ifTrue: [self keyNotFoundErrorFor: #at: index: key]

ifFalse: [obj value]

and then increasingly inlined some more messages to finally come to something like this:

at: key

| index obj length pass probe aHashValue |

length := self basicSize.

pass := 1.

aHashValue := key hash.

index := (length > 4095 and: [aHashValue <= 16383])

ifTrue: [aHashValue * (length // 4095 + 1) \\ length + 1]

ifFalse: [aHashValue \\ length + 1].

[(probe := self basicAt: index) == nil or: [probe key = key]]

whileFalse:

[(index := index + 1) > length

ifTrue:

[index := 1.

pass := pass + 1.

pass > 2 ifTrue: [index := self grow findKeyOrNil: key]]].

obj := self basicAt: index.

^obj == nil

ifTrue: [self keyNotFoundErrorFor: #at: index: key]

ifFalse: [obj value]

I’ve summerized the results (benchmarks in VW7.6; the benchmark code is listed at the end of the mail. The memory size has been set to 128MB):

1) Standard Dictionary>>at: implementation: 6.3sec

2) Using a shared var to avoid copying block: 4.2sec

3) Using an instvar to avoid coying block: 4.3sec

4) Using Michel’s technique to use stack variables to avoid copying blocks: 4.0sec

5) Inlining the at:ifAbsent: into the at: method: 3.4sec

6) Inlined the #findKeyOrNil: into the previous method: 3.1sec

7) Inlined the #initialIndexFor:boundedBy: into the previous: 3.0sec

8) Inlined the constant values into the previous: 2.8sec

The difference between nr 4 & 5 is remarkable: 15% faster by avoiding ONE message send.

The difference between the standard and the final inlined version (nr 8) results in a speedup of x2.25. There seems to be quite some overhead associated with a message send (going from step 5 to step 8 only removed message sends, there were no extra copying blocks removed).

I still used my original benchmark, not taking loop overhead into account:

|d|

d := Dictionary new.

d at: #ababa put: 1.

d at: #abcacba put: 2.

d at: #wwabwaba put: 3.

d at: #tahhbg put: 4.

Time millisecondsToRun: [1000000 timesRepeat:

[d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg.

d at: #ababa.

d at: #abcacba.

d at: #wwabwaba.

d at: #tahhbg].]

Mark

From: [hidden email] [mailto:[hidden email]] On Behalf Of Michel Tilman
Sent: vrijdag 10 oktober 2008 16:17
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Hello,

Out of curiosity i tried (using Mark's benchmark) the following experiments.

1) On my machine (iMac intel OSX / VW7.6) the time for his solution is about 2.8".

2) Adding (the "k" argument is of course redundant here)

at: key ifAbsent: aBlock receiver: receiver key: k

| index obj|

index := self findKeyOrNil: key.

obj := self basicAt: index.

^obj == nil

ifTrue: [aBlock value: receiver value: k ]

ifFalse: [obj value]

and using

at: key

^ self at: key ifAbsent: [ :rcvr :k | a keyNotFoundErrorFor: #at: index: b ] receiver: self key: key

results in about 2.3".

We have a clean block here, and little overhead.

3) Using a more 'generic' variant

at: key ifAbsent: aBlock context: context

| index obj|

index := self findKeyOrNil: key.

obj := self basicAt: index.

^obj == nil

ifTrue: [aBlock value: context ]

ifFalse: [obj value]

with

at: key

"Answer the value at key. If key is not found, raise a signal."

^ self at: key ifAbsent: [ :ctxt | ctxt first keyNotFoundErrorFor: #at: index: ctxt last ] context: (Array with: self with: key)

takes about 3.6".

Again we have a clean block but added the overhead of dealing with arrays.

4) Adding instance variables "a" and "b" to dictionaries to force the use of copying blocks as follows (and to avoid having to thinker with the compiler):

at: key

a := self.

b := key.

^ self at: key ifAbsent: [ a keyNotFoundErrorFor: #at: index: b ]

yields a time of about 3.4".

(We ignore all threading issues here).

If we (could) remove the assignments to "a" and "b" (for the sake of the argument), then execution time drops to about 2.9".

michel

On 08 Oct 2008, at 17:01, Alan Knight wrote:

From: [hidden email] [[hidden email]] On Behalf Of Mark Plas
Sent: Wednesday, October 08, 2008 7:08 AM
To: [hidden email]
Subject: [vwnc] VM optimization: code inlining?

Hi all,

This is what the Dictionary>>at: implementation looks like in vw7.4.1:

at: key
      "Answer the value at key. If key is not found, raise a signal."

      ^self at: key ifAbsent: [self keyNotFoundErrorFor: #at: index: key]

The parameter for the ifAbsent: block is a copying block because it references #self & #key. This means that for each invocation of the #at: method, a copying block is created with two copied values. This is unfortunate because most the times, the key will be present when sending #at: and the ifAbsent block will never be evaluated.

To work around this I added a shared variable called #UndefinedValue on the Dictionary class and initialized it with Object new.
This allows me to the change the #at: method to this:

at: key
      "Answer the value at key. If key is not found, raise a signal."

      | result |
      result := self at: key ifAbsent: [UndefinedValue].
      result == UndefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].
      ^result

In this implementation, the copying block is missing, and when I run some benchmarks with it, it runs in 75% of the original time. There are other methods on Dictionary that can be optimized in the same way, so it might be interesting to introduce this kind of optimization in the base product.

But this was not the reason for the mail. It made me think that in many cases copying blocks (or even full blocks) are not needed. In the given example, there’s no need to make a copying block because, when it gets evaluated, the calling stack frame will still exist and the block could just have used the stack variables instead of copying these values locally.

It would be very interesting to see the VM evolve in such a way that it can detect these cases. I believe there are techniques available that can achieve this. Strongtalk does it by using method inlining, but perhaps there are other ways to do this? I can imagine that if this would find its way into all the smalltalk code, it could offer a significant speedup because blocks are used pretty much everywhere.

I wonder whether Cincom has plans to put these kinds of optimizations into their VM? Apart from this, is there documentation available describing the kinds of optimizations that happen in the VisualWorks VM?

Thanks,
Mark

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Alan Knight [|], Engineering Manager, Cincom Smalltalk

[hidden email]

http://www.cincom.com/smalltalk

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc