The Inbox: Kernel-cmm.1198.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

The Inbox: Kernel-cmm.1198.mcz

commits-2
Chris Muller uploaded a new version of Kernel to project The Inbox:
http://source.squeak.org/inbox/Kernel-cmm.1198.mcz

==================== Summary ====================

Name: Kernel-cmm.1198
Author: cmm
Time: 23 November 2018, 11:12:47.414703 pm
UUID: fe228ca8-2ec7-4432-b3d9-76da98be4475
Ancestors: Kernel-eem.1197

- Suggestion that #basicClass should be inlined while #class should be a message send, so that Proxy's can be supported.
- If so, then #xxxClass can be banished.
- With #xxxClass banished, the Squeak code that called it can be written normally, simply as "class".

=============== Diff against Kernel-eem.1197 ===============

Item was added:
+ ----- Method: Object>>basicClass (in category 'class membership') -----
+ basicClass
+ "Primitive. Answer the object which is the receiver's class. Essential. See
+ Object documentation whatIsAPrimitive."
+
+ <primitive: 111>
+ self primitiveFailed!

Item was changed:
  ----- Method: Object>>class (in category 'class membership') -----
  class
+ "Answer the object which is the receiver's class. Essential."
- "Primitive. Answer the object which is the receiver's class. Essential. See
- Object documentation whatIsAPrimitive."
 
+ ^ self basicClass!
- <primitive: 111>
- self primitiveFailed!

Item was changed:
  ----- Method: Object>>storeDataOn: (in category 'objects from disk') -----
  storeDataOn: aDataStream
  "Store myself on a DataStream.  Answer self.  This is a low-level DataStream/ReferenceStream method. See also objectToStoreOnDataStream.  NOTE: This method must send 'aDataStream beginInstance:size:' and then (nextPut:/nextPutWeak:) its subobjects.  readDataFrom:size: reads back what we write here."
  | cntInstVars cntIndexedVars |
 
  cntInstVars := self class instSize.
  cntIndexedVars := self basicSize.
  aDataStream
+ beginInstance: self class
- beginInstance: self xxxClass
  size: cntInstVars + cntIndexedVars.
  1 to: cntInstVars do:
  [:i | aDataStream nextPut: (self instVarAt: i)].
 
  "Write fields of a variable length object.  When writing to a dummy
  stream, don't bother to write the bytes"
  ((aDataStream byteStream class == DummyStream) and: [self class isBits]) ifFalse: [
  1 to: cntIndexedVars do:
  [:i | aDataStream nextPut: (self basicAt: i)]].
  !

Item was removed:
- ----- Method: Object>>xxxClass (in category 'class membership') -----
- xxxClass
- "For subclasses of nil, such as ObjectOut"
- ^ self class!


Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Kernel-cmm.1198.mcz

Levente Uzonyi
On Sat, 24 Nov 2018, [hidden email] wrote:

> Chris Muller uploaded a new version of Kernel to project The Inbox:
> http://source.squeak.org/inbox/Kernel-cmm.1198.mcz
>
> ==================== Summary ====================
>
> Name: Kernel-cmm.1198
> Author: cmm
> Time: 23 November 2018, 11:12:47.414703 pm
> UUID: fe228ca8-2ec7-4432-b3d9-76da98be4475
> Ancestors: Kernel-eem.1197
>
> - Suggestion that #basicClass should be inlined while #class should be a message send, so that Proxy's can be supported.

It won't work while the special bytecode for #class is compiled. And even
after that, you have to recompile all senders of #class to make it use
the primitive and the new method instead of optimizing it away.

> - If so, then #xxxClass can be banished.
> - With #xxxClass banished, the Squeak code that called it can be written normally, simply as "class".

That won't work either for the same reason. And we do not want to remove
the bytecode, do we?

Levente

>
> =============== Diff against Kernel-eem.1197 ===============
>
> Item was added:
> + ----- Method: Object>>basicClass (in category 'class membership') -----
> + basicClass
> + "Primitive. Answer the object which is the receiver's class. Essential. See
> + Object documentation whatIsAPrimitive."
> +
> + <primitive: 111>
> + self primitiveFailed!
>
> Item was changed:
>  ----- Method: Object>>class (in category 'class membership') -----
>  class
> + "Answer the object which is the receiver's class. Essential."
> - "Primitive. Answer the object which is the receiver's class. Essential. See
> - Object documentation whatIsAPrimitive."
>
> + ^ self basicClass!
> - <primitive: 111>
> - self primitiveFailed!
>
> Item was changed:
>  ----- Method: Object>>storeDataOn: (in category 'objects from disk') -----
>  storeDataOn: aDataStream
>   "Store myself on a DataStream.  Answer self.  This is a low-level DataStream/ReferenceStream method. See also objectToStoreOnDataStream.  NOTE: This method must send 'aDataStream beginInstance:size:' and then (nextPut:/nextPutWeak:) its subobjects.  readDataFrom:size: reads back what we write here."
>   | cntInstVars cntIndexedVars |
>
>   cntInstVars := self class instSize.
>   cntIndexedVars := self basicSize.
>   aDataStream
> + beginInstance: self class
> - beginInstance: self xxxClass
>   size: cntInstVars + cntIndexedVars.
>   1 to: cntInstVars do:
>   [:i | aDataStream nextPut: (self instVarAt: i)].
>
>   "Write fields of a variable length object.  When writing to a dummy
>   stream, don't bother to write the bytes"
>   ((aDataStream byteStream class == DummyStream) and: [self class isBits]) ifFalse: [
>   1 to: cntIndexedVars do:
>   [:i | aDataStream nextPut: (self basicAt: i)]].
>  !
>
> Item was removed:
> - ----- Method: Object>>xxxClass (in category 'class membership') -----
> - xxxClass
> - "For subclasses of nil, such as ObjectOut"
> - ^ self class!

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Kernel-cmm.1198.mcz

Chris Muller-3
> > Chris Muller uploaded a new version of Kernel to project The Inbox:
> > http://source.squeak.org/inbox/Kernel-cmm.1198.mcz
> >
> > ==================== Summary ====================
> >
> > Name: Kernel-cmm.1198
> > Author: cmm
> > Time: 23 November 2018, 11:12:47.414703 pm
> > UUID: fe228ca8-2ec7-4432-b3d9-76da98be4475
> > Ancestors: Kernel-eem.1197
> >
> > - Suggestion that #basicClass should be inlined while #class should be a message send, so that Proxy's can be supported.
>
> It won't work while the special bytecode for #class is compiled. And even
> after that, you have to recompile all senders of #class to make it use
> the primitive and the new method instead of optimizing it away.

Right.  Assuming we can achieve consensus with Eliot, and the next
Squeak will have a new VM, then that would be called from an MC post
script.

But what do you mean make all senders of #class use the primitive?
Just as you suggested the use of #ensureNonProxiedReceiver from the
other thread, the intention here is that #basicClass would better
document those performance-critical places, but leaving the majority
(of non-critical ones) sending #class, so it can be overridable.

Do you think the system would be noticably slower if all the sends to
#class became a message send?  I'm skeptical that it would, but I have
no idea.  I am surprised to see we have so many senders of #class in
trunk, but I have a feeling most rarely ever called.

Removing those byteCodes from my CompiledMethods is above my knowledge
level, but if you could help me come up with a script, I'd be
interested in testing and playing around to learn more.

> > - If so, then #xxxClass can be banished.
> > - With #xxxClass banished, the Squeak code that called it can be written normally, simply as "class".
>
> That won't work either for the same reason. And we do not want to remove
> the bytecode, do we?

Not remove it, redirect it to #basicClass.

This is a reasonable and familiar pattern, right?  It provides users
full control and WYSIWIG between source and bytecodes due to a crystal
clear selector name.  No magic.

 - Chris

 - Chris

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Kernel-cmm.1198.mcz

Levente Uzonyi
On Sat, 24 Nov 2018, Chris Muller wrote:

>>> Chris Muller uploaded a new version of Kernel to project The Inbox:
>> > http://source.squeak.org/inbox/Kernel-cmm.1198.mcz
>> >
>> > ==================== Summary ====================
>> >
>> > Name: Kernel-cmm.1198
>> > Author: cmm
>> > Time: 23 November 2018, 11:12:47.414703 pm
>> > UUID: fe228ca8-2ec7-4432-b3d9-76da98be4475
>> > Ancestors: Kernel-eem.1197
>> >
>> > - Suggestion that #basicClass should be inlined while #class should be a message send, so that Proxy's can be supported.
>>
>> It won't work while the special bytecode for #class is compiled. And even
>> after that, you have to recompile all senders of #class to make it use
>> the primitive and the new method instead of optimizing it away.
>
> Right.  Assuming we can achieve consensus with Eliot, and the next
> Squeak will have a new VM, then that would be called from an MC post
> script.

I don't see what kind of VM changes are necessary here. Care to elaborate?

>
> But what do you mean make all senders of #class use the primitive?

Currently, when you compile a method containing a send of #class, the
compiler will generate a special bytecode for it (199).
When the interpreter/jit sees this bytecode, it will not perform a send
nor a primitive; it'll just look up the class of the receiver and place it
on top of the stack.
You can see this in action by removing the sole implementor of #class from
your image without any effects. That method is only there for
consistency, it is never executed.

So, while the bytecode is in use, it doesn't matter what you do with the
#class method, because it will never be sent.

> Just as you suggested the use of #ensureNonProxiedReceiver from the
> other thread, the intention here is that #basicClass would better
> document those performance-critical places, but leaving the majority
> (of non-critical ones) sending #class, so it can be overridable.

See above.

>
> Do you think the system would be noticably slower if all the sends to
> #class became a message send?  I'm skeptical that it would, but I have

Yes, the bytecode is way quicker than the primitive or a primitive + a
send which is exactly what you suggested.
Also, removing the bytecode will make #class lose its atomicity. Any code
that relies on that behavior will silently break. This pretty much applies
to all special selectors (See SmalltalkImage >> #specialSelectors).

> no idea.  I am surprised to see we have so many senders of #class in
> trunk, but I have a feeling most rarely ever called.

I doubt that. People don't sprinkle #class sends for no reason, do they?

>
> Removing those byteCodes from my CompiledMethods is above my knowledge
> level, but if you could help me come up with a script, I'd be
> interested in testing and playing around to learn more.

VariableNode has a class variable named StdSelectors. It contains the
selectors for which custom bytecodes are generated. Removing #class from
there should be enough.

>
>> > - If so, then #xxxClass can be banished.
>> > - With #xxxClass banished, the Squeak code that called it can be written normally, simply as "class".
>>
>> That won't work either for the same reason. And we do not want to remove
>> the bytecode, do we?
>
> Not remove it, redirect it to #basicClass.

Right, but while the bytecode is in effect, you just can't redirect
it.

Levente

>
> This is a reasonable and familiar pattern, right?  It provides users
> full control and WYSIWIG between source and bytecodes due to a crystal
> clear selector name.  No magic.
>
> - Chris
>
> - Chris

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Kernel-cmm.1198.mcz

Chris Muller-3
Hi Levente,

> > But what do you mean make all senders of #class use the primitive?
>
> Currently, when you compile a method containing a send of #class, the
> compiler will generate a special bytecode for it (199).
> When the interpreter/jit sees this bytecode, it will not perform a send
> nor a primitive; it'll just look up the class of the receiver and place it
> on top of the stack.

Great!  Does that mean this can be accomplished solely in the image by
making the compiler generate 199 when #basicClass is sent, and just
the normal "send" bytecode for sends to #class?

> > Do you think the system would be noticably slower if all the sends to
> > #class became a message send?  I'm skeptical that it would, but I have
>
> Yes, the bytecode is way quicker than the primitive or a primitive + a
> send which is exactly what you suggested.

It saves one send.  One.  That's only infinitesimally quicker:
_________
{ [1 xxxClass] bench.
 [ 1 class ] bench.  }

   ----> #('99,000,000 per second. 10.1 nanoseconds per run.'
'126,000,000 per second. 7.93 nanoseconds per run.')
________

2 nanoseconds per send faster.  Inconsequential in any real-world
sense.  Furthermore, as soon as the message sent to the class does
*any work* whatsoever, that good-sounding 27% improvement is quickly
wiped out.  Look how much of the gain is lost doing as little as
creating one single Rectangle from another one:

___________
"Compare creating a single Rectangle with inlined #class vs. a
(proposed) message-send of #class."
| someRectangle |   someRectangle := 100@50 corner: 320@200.
{  [someRectangle xxxClass origin: someRectangle topLeft corner:
someRectangle bottomRight ] bench.
   [someRectangle class      origin: someRectangle topLeft corner:
someRectangle bottomRight ] bench.   }

     --->  #('37,200,000 per second. 26.9 nanoseconds per run.'
'38,000,000 per second. 26.3 nanoseconds per run.')
____________

Real-world gain by the inlined send was reduced to...  whew!  I just
had to go learn about "Picosecond" because nanoseconds aren't even
small enough to measure the improvement.

So, amplify.  Crank it up to 100K:
__________
"Compare creating a 100,000 Rectangles with inlined #class vs. a
message-send of #class."
| someRectangle |   someRectangle := 100@50 corner: 320@200.
{  [ 100000 timesRepeat: [someRectangle xxxClass origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench.
   [ 100000 timesRepeat: [someRectangle class origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench.   }

     ---> #('364 per second. 2.75 milliseconds per run.' '369 per
second. 2.71 milliseconds per run.')
_________

Nothing times 100K is still nothing.

> Also, removing the bytecode will make #class lose its atomicity. Any code
> that relies on that behavior will silently break.

If THAT exists it needs a more intention-revealing selector than
#class that would let his peers know atomicity mattered there.
#basicClass is his friend.

> > ...  I am surprised to see we have so many senders of #class in
> > trunk, but I have a feeling most rarely ever called.
>
> I doubt that. People don't sprinkle #class sends for no reason, do they?

Sorry, I should not have said "ever".  I was trying to say the system
probably spends most of its time sending to instance-side methods than
class-side methods.

> > Not remove it, redirect it to #basicClass.
>
> Right, but while the bytecode is in effect, you just can't redirect
> it.

I'm racking my brain trying to understand this -- sorry...   By
"redirect" I just meant change the Compiler to generate bytecode 199
for sends to #basicClass, and just the regular "send" bytecode for
sends to #class.  Then, recompile all methods.  Would that work?

> > This is a reasonable and familiar pattern, right?  It provides users
> > full control and WYSIWIG between source and bytecodes due to a crystal
> > clear selector name.  No magic.

So, if
   performance is not really hurt, and
   we can keep sending #class if so insisted, and
   we still have #basicClass, just in case, together
   delineating an elegant seam between system-level vs. user-level access
   in a classic Smalltalky way that even *I* can understand and use,
   and give Squeak better Proxy support that helps Magma
then
   would you let me have this?

You have a skill of making performance-considerations to such degrees
that I never even would have fathomed, and this has resulted in
immense performance benefits for Squeak.  I do wish you liked Magma,
because I'm sure you could _obliterate_ many inefficiencies in the
code and design.  But if not, I hope you can at least appreciate the
value proposition of this proposal is worth it.

 - Chris

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Kernel-cmm.1198.mcz

Levente Uzonyi
Hi Chris,

On Sun, 25 Nov 2018, Chris Muller wrote:

> Hi Levente,
>
>>> But what do you mean make all senders of #class use the primitive?
>>
>> Currently, when you compile a method containing a send of #class, the
>> compiler will generate a special bytecode for it (199).
>> When the interpreter/jit sees this bytecode, it will not perform a send
>> nor a primitive; it'll just look up the class of the receiver and place it
>> on top of the stack.
>
> Great!  Does that mean this can be accomplished solely in the image by
> making the compiler generate 199 when #basicClass is sent, and just
> the normal "send" bytecode for sends to #class?
>
>>> Do you think the system would be noticably slower if all the sends to
>>> #class became a message send?  I'm skeptical that it would, but I have
>>
>> Yes, the bytecode is way quicker than the primitive or a primitive + a
>> send which is exactly what you suggested.
>
> It saves one send.  One.  That's only infinitesimally quicker:
> _________
> { [1 xxxClass] bench.
> [ 1 class ] bench.  }
>
>   ----> #('99,000,000 per second. 10.1 nanoseconds per run.'
> '126,000,000 per second. 7.93 nanoseconds per run.')
> ________
>
> 2 nanoseconds per send faster.  Inconsequential in any real-world
> sense.  Furthermore, as soon as the message sent to the class does
> *any work* whatsoever, that good-sounding 27% improvement is quickly
> wiped out.  Look how much of the gain is lost doing as little as
> creating one single Rectangle from another one:
>
> ___________
> "Compare creating a single Rectangle with inlined #class vs. a
> (proposed) message-send of #class."
> | someRectangle |   someRectangle := 100@50 corner: 320@200.
> {  [someRectangle xxxClass origin: someRectangle topLeft corner:
> someRectangle bottomRight ] bench.
>   [someRectangle class      origin: someRectangle topLeft corner:
> someRectangle bottomRight ] bench.   }
>
>     --->  #('37,200,000 per second. 26.9 nanoseconds per run.'
> '38,000,000 per second. 26.3 nanoseconds per run.')
> ____________
>
> Real-world gain by the inlined send was reduced to...  whew!  I just
> had to go learn about "Picosecond" because nanoseconds aren't even
> small enough to measure the improvement.
>
> So, amplify.  Crank it up to 100K:
> __________
> "Compare creating a 100,000 Rectangles with inlined #class vs. a
> message-send of #class."
> | someRectangle |   someRectangle := 100@50 corner: 320@200.
> {  [ 100000 timesRepeat: [someRectangle xxxClass origin: someRectangle
> topLeft corner: someRectangle bottomRight] ] bench.
>   [ 100000 timesRepeat: [someRectangle class origin: someRectangle
> topLeft corner: someRectangle bottomRight] ] bench.   }
>
>     ---> #('364 per second. 2.75 milliseconds per run.' '369 per
> second. 2.71 milliseconds per run.')
> _________
>
> Nothing times 100K is still nothing.


That's not the right way to measure things that are so quick, because the
overhead of block activation is comparable to the runtime of the code
inside the block. Also, #timesRepeat: is not a good choice for
measurements for the very same reason: block creation + lots of block
activation.
Also, the nearby bytecodes affect what the JIT does. When more things can
be executed without performing a send, the overall performance gains
will be higher.

>
>> Also, removing the bytecode will make #class lose its atomicity. Any code
>> that relies on that behavior will silently break.
>
> If THAT exists it needs a more intention-revealing selector than
> #class that would let his peers know atomicity mattered there.
> #basicClass is his friend.

All special selectors do the same e.g. #==, #ifNil:, #ifTrue:. Do you
think all of those need #basicXXX methods?

>
>>> ...  I am surprised to see we have so many senders of #class in
>>> trunk, but I have a feeling most rarely ever called.
>>
>> I doubt that. People don't sprinkle #class sends for no reason, do they?
>
> Sorry, I should not have said "ever".  I was trying to say the system
> probably spends most of its time sending to instance-side methods than
> class-side methods.

It's a common pattern to have instance-independent code on the class side.
Quick access to that is always a good thing.

>
>>> Not remove it, redirect it to #basicClass.
>>
>> Right, but while the bytecode is in effect, you just can't redirect
>> it.
>
> I'm racking my brain trying to understand this -- sorry...   By
> "redirect" I just meant change the Compiler to generate bytecode 199
> for sends to #basicClass, and just the regular "send" bytecode for
> sends to #class.  Then, recompile all methods.  Would that work?

It might work, but you would need to identify and rewrite senders of
#class which rely on the presence of the bytecode. In my image there are
2174 senders, which is simply too much review in my opinion.

I did some measurements and found that the JIT makes the numbered
primitive almost as quick as the bytecode. The slowdown is only about 10%.
Your suggestion, which is send + bytecode is about 85% slower and loses
the atomicity of the message. So, you'd better leave the implementation of
#class as it is right now, because that would be quicker and would
preserve the atomicity as long as nothing overrides it.

>
>>> This is a reasonable and familiar pattern, right?  It provides users
>>> full control and WYSIWIG between source and bytecodes due to a crystal
>>> clear selector name.  No magic.
>
> So, if
>   performance is not really hurt, and
>   we can keep sending #class if so insisted, and
>   we still have #basicClass, just in case, together
>   delineating an elegant seam between system-level vs. user-level access
>   in a classic Smalltalky way that even *I* can understand and use,
>   and give Squeak better Proxy support that helps Magma
> then
>   would you let me have this?

As I wrote it a few emails earlier, I'd rather have a "switch" for this
than forcing it on everyone who don't use proxies at all (I presume that's
the current majority of Squeak users).

Levente

>
> You have a skill of making performance-considerations to such degrees
> that I never even would have fathomed, and this has resulted in
> immense performance benefits for Squeak.  I do wish you liked Magma,
> because I'm sure you could _obliterate_ many inefficiencies in the
> code and design.  But if not, I hope you can at least appreciate the
> value proposition of this proposal is worth it.
>
> - Chris
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Kernel-cmm.1198.mcz

Chris Muller-4
Hi Levente,

Just a reminder, the original question I asked was:

> >>> Do you think the system would be noticably slower if all the sends to
> >>> #class became a message send?  ...

and your response:

> >> Yes, the bytecode is way quicker than the primitive or a primitive + a
> >> send which is exactly what you suggested.

So even though you answered a different question, I was still curious
by your claim, and remembered that you're one has liked to communicate
with benchmarks.  That's why I ran and presented them to you, but I'm
not sure if we're interpreting the results relative to my question or
some other question...

> > It saves one send.  One.  That's only infinitesimally quicker:
> > _________
> > { [1 xxxClass] bench.
> > [ 1 class ] bench.  }
> >
> >   ----> #('99,000,000 per second. 10.1 nanoseconds per run.'
> > '126,000,000 per second. 7.93 nanoseconds per run.')
> > ________
> >
> > 2 nanoseconds per send faster.  Inconsequential in any real-world
> > sense.  Furthermore, as soon as the message sent to the class does
> > *any work* whatsoever, that good-sounding 27% improvement is quickly
> > wiped out.  Look how much of the gain is lost doing as little as
> > creating one single Rectangle from another one:
> >
> > ___________
> > "Compare creating a single Rectangle with inlined #class vs. a
> > (proposed) message-send of #class."
> > | someRectangle |   someRectangle := 100@50 corner: 320@200.
> > {  [someRectangle xxxClass origin: someRectangle topLeft corner:
> > someRectangle bottomRight ] bench.
> >   [someRectangle class      origin: someRectangle topLeft corner:
> > someRectangle bottomRight ] bench.   }
> >
> >     --->  #('37,200,000 per second. 26.9 nanoseconds per run.'
> > '38,000,000 per second. 26.3 nanoseconds per run.')
> > ____________
> >
> > Real-world gain by the inlined send was reduced to...  whew!  I just
> > had to go learn about "Picosecond" because nanoseconds aren't even
> > small enough to measure the improvement.
> >
> > So, amplify.  Crank it up to 100K:
> > __________
> > "Compare creating a 100,000 Rectangles with inlined #class vs. a
> > message-send of #class."
> > | someRectangle |   someRectangle := 100@50 corner: 320@200.
> > {  [ 100000 timesRepeat: [someRectangle xxxClass origin: someRectangle
> > topLeft corner: someRectangle bottomRight] ] bench.
> >   [ 100000 timesRepeat: [someRectangle class origin: someRectangle
> > topLeft corner: someRectangle bottomRight] ] bench.   }
> >
> >     ---> #('364 per second. 2.75 milliseconds per run.' '369 per
> > second. 2.71 milliseconds per run.')
> > _________
> >
> > Nothing times 100K is still nothing.
>
>
> That's not the right way to measure things that are so quick, because the
> overhead of block activation is comparable to the runtime of the code
> inside the block. Also, #timesRepeat: is not a good choice for
> measurements for the very same reason: block creation + lots of block
> activation.
> Also, the nearby bytecodes affect what the JIT does. When more things can
> be executed without performing a send, the overall performance gains
> will be higher.

There are three benchmarks, did you notice the first two?

    - The first one measures the single-unit cost of #xxxClass over
#class.  This captures your theoretical maximum benefit of 27%, which
is terrible, because it can't come close to that in real code.

    - The second demonstrates how 90% of that 27% benefit is wiped out
with no more than a single simple allocation -- what the vast majority
of class methods are responsible for.

    - The third one measures "real world impact", and shows that this
particular in-line doesn't help the system in any way that helps any
human anywhere.

> >> Also, removing the bytecode will make #class lose its atomicity. Any code
> >> that relies on that behavior will silently break.
> >
> > If THAT exists it needs a more intention-revealing selector than
> > #class that would let his peers know atomicity mattered there.
> > #basicClass is his friend.
>
> All special selectors do the same e.g. #==, #ifNil:, #ifTrue:. Do you
> think all of those need #basicXXX methods?

No just #class.  An identity-check should be an identity-check, even
against a Proxy.  And does that example help illustrate how using #==
when you DON'T need an identity-check is a breakage of encapsulation?
It makes false assumptions and enforces type-conformance in a system
that wants to be empowered by messaging.

> >>> ...  I am surprised to see we have so many senders of #class in
> >>> trunk, but I have a feeling most rarely ever called.
> >>
> >> I doubt that. People don't sprinkle #class sends for no reason, do they?
> >
> > Sorry, I should not have said "ever".  I was trying to say the system
> > probably spends most of its time sending to instance-side methods than
> > class-side methods.
>
> It's a common pattern to have instance-independent code on the class side.
> Quick access to that is always a good thing.

It's still quick!  Levente, I challenge you to back up your claim by
identifying any one single method in the image which reports even only
a meaningfully better *bench* performance (much less real-world) by
calling it via #class instead of #xxxClass.

Anything whose performance matters at a level of one send is going to
use #basicClass anyway, just like we may have a few that we send
#basicNew instead of #new to.

> >>> Not remove it, redirect it to #basicClass.
> >>
> >> Right, but while the bytecode is in effect, you just can't redirect
> >> it.
> >
> > I'm racking my brain trying to understand this -- sorry...   By
> > "redirect" I just meant change the Compiler to generate bytecode 199
> > for sends to #basicClass, and just the regular "send" bytecode for
> > sends to #class.  Then, recompile all methods.  Would that work?
>
> It might work, but you would need to identify and rewrite senders of
> #class which rely on the presence of the bytecode. In my image there are
> 2174 senders, which is simply too much review in my opinion.

I repeat my challenge above!

> I did some measurements and found that the JIT makes the numbered
> primitive almost as quick as the bytecode. The slowdown is only about 10%.
> Your suggestion, which is send + bytecode is about 85% slower and loses
> the atomicity of the message. So, you'd better leave the implementation of
> #class as it is right now, because that would be quicker and would
> preserve the atomicity as long as nothing overrides it.

Huh?  No, you're only 27% faster in the *benchmark*, but near zero in
anything real-world.

My challenge above, stands.  I would love to be wrong, so I could shed
my suspicion of whether this is about something else not mentioned...
 :(

> >>> This is a reasonable and familiar pattern, right?  It provides users
> >>> full control and WYSIWIG between source and bytecodes due to a crystal
> >>> clear selector name.  No magic.
> >
> > So, if
> >   performance is not really hurt, and
> >   we can keep sending #class if so insisted, and
> >   we still have #basicClass, just in case, together
> >   delineating an elegant seam between system-level vs. user-level access
> >   in a classic Smalltalky way that even *I* can understand and use,
> >   and give Squeak better Proxy support that helps Magma
> > then
> >   would you let me have this?
>
> As I wrote it a few emails earlier, I'd rather have a "switch" for this
> than forcing it on everyone who don't use proxies at all (I presume that's
> the current majority of Squeak users).

Whoa, hold on there.  You only ever made one argument -- "performance"
-- which was obliterated by the benchmarks.  Squeezing 27% more out of
a microbench of something called 0.0001% of the time results no
benefit to anyone anywhere.

I see MY position is the pro user position, and yours as the... pro
fastest-lab-result position, but hurts this Squeak user.  I'm sad that
that alone isn't enough to support this.    :(
_______
Do you remember when Behavior>>#new didn't always make a call to
#initialize?  But at a time when Squeak was 10X slower than it is now,
the people then had the wisdom to understand that the computer and
software exists to eventually serve _users_, and that spiting users to
save one single send, even when it was a much greater percentage of
impact back then, was still way worth it.

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Kernel-cmm.1198.mcz

Chris Muller-4
In reply to this post by Levente Uzonyi
> That's not the right way to measure things that are so quick, because the
> overhead of block activation is comparable to the runtime of the code
> inside the block.

I get you, but that its so hard to even write such a test indicates
that real-world code also needs to do a lot of block-activations, and
so this quickly dilutes the density of calls to #class.

The only way I could think was to just cut-and-paste the block innards
X 100 times and measure the degradation from the baseline (single):

{ [1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1
xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass. 1 xxxClass.
1 xxxClass. ] bench.

[ 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1
class. 1 class. 1 class.  1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class.  1 class. 1 class. 1
class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1
class. 1 class. 1 class.  1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class.  1 class. 1 class. 1
class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1
class. 1 class. 1 class.  1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class.  1 class. 1 class. 1
class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class.
1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1 class. 1
class. 1 class. 1 class. ] bench.  }

 #('2,780,000 per second. 360 nanoseconds per run.' '5,590,000 per
second. 179 nanoseconds per run.')

So X100 more density of calls to #xxxClass degraded the performance
from 27% slower to 50% slower.

So the real question is how dense are the calls to #class, and are
they mostly from only a few senders which could retain the
optimization by #basicClass?  It would be an interesting experiment.
Pointless, though, if there's no chance of swaying you.

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Kernel-cmm.1198.mcz

Levente Uzonyi
In reply to this post by Chris Muller-4
Hi Chris,

This conversation is getting off the track, so let's take a step back and
try something different.
I had suggested you a solution: the "switch", but you never mentioned how
it worked for you. Perhaps my explanation wasn't clear.
Let me just give you a snippet which does exactly what I suggested.
Please try it in your image (one without Kernel-cmm.1198 loaded) and let
me know if it solved your problem or not:

  (ParseNode classPool at: #StdSelectors) removeKey: #class.
  Compiler recompileAll.

Levente

P.S.: Here's the benchmark I used to get my numbers:

runs := (1 to: 5) collect: [ :e |
{
  [ 1 to: 50000000 do: [ :i | i class class class class class class
class class class class ] ] timeToRun.
  [ 1 to: 50000000 do: [ :i | i classPrimitive classPrimitive
classPrimitive classPrimitive classPrimitive classPrimitive classPrimitive
classPrimitive classPrimitive classPrimitive ] ] timeToRun.
  [ 1 to: 50000000 do: [ :i | i classSend classSend classSend
classSend classSend classSend classSend classSend classSend classSend ] ]
timeToRun.
  [ 1 to: 50000000 do: [ :i | i ] ] timeToRun } ].
cleanRuns := runs collect: [ :e | (e - e last) allButLast ].
primitiveVsByteCode := (cleanRuns collect: [ :e | e second / e first ])
average printShowingMaxDecimalPlaces: 2.
sendVsByteCode := (cleanRuns collect: [ :e | e third / e first ]) average
printShowingMaxDecimalPlaces: 2.

Where Object >> #classPrimitive is

classPrimitive
  "Primitive. Answer the object which is the receiver's class.
Essential. See
  Object documentation whatIsAPrimitive."

  <primitive: 111>
  self primitiveFailed

And Object >> #classSend is

classSend
  "Primitive. Answer the object which is the receiver's class.
Essential. See
  Object documentation whatIsAPrimitive."

  ^self class

On Sun, 25 Nov 2018, Chris Muller wrote:

> Hi Levente,
>
> Just a reminder, the original question I asked was:
>
>>>>> Do you think the system would be noticably slower if all the sends to
>>>>> #class became a message send?  ...
>
> and your response:
>
>>>> Yes, the bytecode is way quicker than the primitive or a primitive + a
>>>> send which is exactly what you suggested.
>
> So even though you answered a different question, I was still curious
> by your claim, and remembered that you're one has liked to communicate
> with benchmarks.  That's why I ran and presented them to you, but I'm
> not sure if we're interpreting the results relative to my question or
> some other question...
>
>>> It saves one send.  One.  That's only infinitesimally quicker:
>>> _________
>>> { [1 xxxClass] bench.
>>> [ 1 class ] bench.  }
>>>
>>>   ----> #('99,000,000 per second. 10.1 nanoseconds per run.'
>>> '126,000,000 per second. 7.93 nanoseconds per run.')
>>> ________
>>>
>>> 2 nanoseconds per send faster.  Inconsequential in any real-world
>>> sense.  Furthermore, as soon as the message sent to the class does
>>> *any work* whatsoever, that good-sounding 27% improvement is quickly
>>> wiped out.  Look how much of the gain is lost doing as little as
>>> creating one single Rectangle from another one:
>>>
>>> ___________
>>> "Compare creating a single Rectangle with inlined #class vs. a
>>> (proposed) message-send of #class."
>>> | someRectangle |   someRectangle := 100@50 corner: 320@200.
>>> {  [someRectangle xxxClass origin: someRectangle topLeft corner:
>>> someRectangle bottomRight ] bench.
>>>   [someRectangle class      origin: someRectangle topLeft corner:
>>> someRectangle bottomRight ] bench.   }
>>>
>>>     --->  #('37,200,000 per second. 26.9 nanoseconds per run.'
>>> '38,000,000 per second. 26.3 nanoseconds per run.')
>>> ____________
>>>
>>> Real-world gain by the inlined send was reduced to...  whew!  I just
>>> had to go learn about "Picosecond" because nanoseconds aren't even
>>> small enough to measure the improvement.
>>>
>>> So, amplify.  Crank it up to 100K:
>>> __________
>>> "Compare creating a 100,000 Rectangles with inlined #class vs. a
>>> message-send of #class."
>>> | someRectangle |   someRectangle := 100@50 corner: 320@200.
>>> {  [ 100000 timesRepeat: [someRectangle xxxClass origin: someRectangle
>>> topLeft corner: someRectangle bottomRight] ] bench.
>>>   [ 100000 timesRepeat: [someRectangle class origin: someRectangle
>>> topLeft corner: someRectangle bottomRight] ] bench.   }
>>>
>>>     ---> #('364 per second. 2.75 milliseconds per run.' '369 per
>>> second. 2.71 milliseconds per run.')
>>> _________
>>>
>>> Nothing times 100K is still nothing.
>>
>>
>> That's not the right way to measure things that are so quick, because the
>> overhead of block activation is comparable to the runtime of the code
>> inside the block. Also, #timesRepeat: is not a good choice for
>> measurements for the very same reason: block creation + lots of block
>> activation.
>> Also, the nearby bytecodes affect what the JIT does. When more things can
>> be executed without performing a send, the overall performance gains
>> will be higher.
>
> There are three benchmarks, did you notice the first two?
>
>    - The first one measures the single-unit cost of #xxxClass over
> #class.  This captures your theoretical maximum benefit of 27%, which
> is terrible, because it can't come close to that in real code.
>
>    - The second demonstrates how 90% of that 27% benefit is wiped out
> with no more than a single simple allocation -- what the vast majority
> of class methods are responsible for.
>
>    - The third one measures "real world impact", and shows that this
> particular in-line doesn't help the system in any way that helps any
> human anywhere.
>
>>>> Also, removing the bytecode will make #class lose its atomicity. Any code
>>>> that relies on that behavior will silently break.
>>>
>>> If THAT exists it needs a more intention-revealing selector than
>>> #class that would let his peers know atomicity mattered there.
>>> #basicClass is his friend.
>>
>> All special selectors do the same e.g. #==, #ifNil:, #ifTrue:. Do you
>> think all of those need #basicXXX methods?
>
> No just #class.  An identity-check should be an identity-check, even
> against a Proxy.  And does that example help illustrate how using #==
> when you DON'T need an identity-check is a breakage of encapsulation?
> It makes false assumptions and enforces type-conformance in a system
> that wants to be empowered by messaging.
>
>>>>> ...  I am surprised to see we have so many senders of #class in
>>>>> trunk, but I have a feeling most rarely ever called.
>>>>
>>>> I doubt that. People don't sprinkle #class sends for no reason, do they?
>>>
>>> Sorry, I should not have said "ever".  I was trying to say the system
>>> probably spends most of its time sending to instance-side methods than
>>> class-side methods.
>>
>> It's a common pattern to have instance-independent code on the class side.
>> Quick access to that is always a good thing.
>
> It's still quick!  Levente, I challenge you to back up your claim by
> identifying any one single method in the image which reports even only
> a meaningfully better *bench* performance (much less real-world) by
> calling it via #class instead of #xxxClass.
>
> Anything whose performance matters at a level of one send is going to
> use #basicClass anyway, just like we may have a few that we send
> #basicNew instead of #new to.
>
>>>>> Not remove it, redirect it to #basicClass.
>>>>
>>>> Right, but while the bytecode is in effect, you just can't redirect
>>>> it.
>>>
>>> I'm racking my brain trying to understand this -- sorry...   By
>>> "redirect" I just meant change the Compiler to generate bytecode 199
>>> for sends to #basicClass, and just the regular "send" bytecode for
>>> sends to #class.  Then, recompile all methods.  Would that work?
>>
>> It might work, but you would need to identify and rewrite senders of
>> #class which rely on the presence of the bytecode. In my image there are
>> 2174 senders, which is simply too much review in my opinion.
>
> I repeat my challenge above!
>
>> I did some measurements and found that the JIT makes the numbered
>> primitive almost as quick as the bytecode. The slowdown is only about 10%.
>> Your suggestion, which is send + bytecode is about 85% slower and loses
>> the atomicity of the message. So, you'd better leave the implementation of
>> #class as it is right now, because that would be quicker and would
>> preserve the atomicity as long as nothing overrides it.
>
> Huh?  No, you're only 27% faster in the *benchmark*, but near zero in
> anything real-world.
>
> My challenge above, stands.  I would love to be wrong, so I could shed
> my suspicion of whether this is about something else not mentioned...
> :(
>
>>>>> This is a reasonable and familiar pattern, right?  It provides users
>>>>> full control and WYSIWIG between source and bytecodes due to a crystal
>>>>> clear selector name.  No magic.
>>>
>>> So, if
>>>   performance is not really hurt, and
>>>   we can keep sending #class if so insisted, and
>>>   we still have #basicClass, just in case, together
>>>   delineating an elegant seam between system-level vs. user-level access
>>>   in a classic Smalltalky way that even *I* can understand and use,
>>>   and give Squeak better Proxy support that helps Magma
>>> then
>>>   would you let me have this?
>>
>> As I wrote it a few emails earlier, I'd rather have a "switch" for this
>> than forcing it on everyone who don't use proxies at all (I presume that's
>> the current majority of Squeak users).
>
> Whoa, hold on there.  You only ever made one argument -- "performance"
> -- which was obliterated by the benchmarks.  Squeezing 27% more out of
> a microbench of something called 0.0001% of the time results no
> benefit to anyone anywhere.
>
> I see MY position is the pro user position, and yours as the... pro
> fastest-lab-result position, but hurts this Squeak user.  I'm sad that
> that alone isn't enough to support this.    :(
> _______
> Do you remember when Behavior>>#new didn't always make a call to
> #initialize?  But at a time when Squeak was 10X slower than it is now,
> the people then had the wisdom to understand that the computer and
> software exists to eventually serve _users_, and that spiting users to
> save one single send, even when it was a much greater percentage of
> impact back then, was still way worth it.
>