primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Levente Uzonyi
 
Hi Dave,

I dag a bit deeper into this problem, and I found that it's been around
for ages. I compared the two primitives in three older images.
The first one is Squeak 4.2 running on Cog r2714:

[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
'6,880,000 per second.'

| order |
order := (0 to: 255) as: ByteArray.
[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench.
'11,200,000 per second.'

The next one was an even older Cog VM running an image updated from
3.10.2. The VM didn't have the primitives required to fetch the VM
information. The result were '8.22459348130374e6 per second.' and
'1.19677724910036e7 per second.', respectively.

The last image was a closure-enabled 3.10.2 running on the classic
Interpreter VM. The results were '3.911442911417717e6 per second.' and
'4.84567866426715e6 per second.', respectively.

Since in all VMs the seemingly more complex code (primitiveCompareString
with a subtraction) was quicker than the simpler code
(primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled
with less aggressive optimization than MiscPrimitivePlugin.

What's also interesting is that, based on these benchmarks, the
performance got worse over the years. I think invoking a primitive has its
cost in newer VMs.

Levente

On Thu, 14 Apr 2016, David T. Lewis wrote:

> Oops, I just realized that my "interpreter VM" results were from a debugging
> VM with compiler optimization off (too many VMs, sorry). Here are the results
> for a more representative interpreter VM. I can't explain the variation, but
> one conclusion is clear: my suggested "optimization" in the inbox is a bad idea
> (and I will move it to the treated inbox).
>
>  "Base performance in the trunk level V3 image with interpreter VM"
>  Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669"
>  Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052"
>  Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
>
>  "Performance after adding LargePositiveInteger>>="
>  Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224"
>  Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]].  "==> 27824"
>  Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
>
> Dave
>
>
>
> On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
>> Hi Levente,
>>
>> I think you may be right. I repeated my test with an interpreter VM on a
>> 32-bit image (with loop count smaller because the interpreter VM is slower
>> than Spur). The change that I put in the inbox does not have any benefit
>> for the interpreter VM:
>>
>>
>>   a := 7432154326465436. "a big number"
>>   b := a + 1. "low order digit changed"
>>   c := 8432154326465436. "high order digit changed"
>>   d := 7432154026465436. "a digit in the middle changed"
>>
>>    "Base performance in the trunk level V3 image with interpreter VM"
>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844"
>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786"
>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
>>
>>   "Performance after adding LargePositiveInteger>>="
>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868"
>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775"
>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
>>
>> So yes it is something related to the VM. But I do not understand
>> how #primDigitCompare could be so slow? As you say, maybe something
>> is wrong with it.
>>
>> Thank you, I am glad that I asked for a review :-)
>>
>> Dave
>>
>>
>> On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
>>> Hi Dave,
>>>
>>> I guess this is a VM related issue. On 64-bit Spur
>>> [ 7432154326465436 digitCompare: 8432154326465436 ] bench.
>>> returns '5,420,000 per second. 184 nanoseconds per run.'.
>>>
>>> Removing the primitive call from #digitCompare: the number goes up:
>>> '20,200,000 per second. 49.4 nanoseconds per run.'.
>>>
>>> You might think that it's okay because the JIT is so good. But we have
>>> another primitive to compare two bytes-objects. One which ought be
>>> slower than #primDigitCompare:, because it maps the bytes before doing the
>>> comparison. I even subtracted two from its result to match the result of
>>> #digitCompare:
>>>
>>> | order |
>>> order := (0 to: 255) as: ByteArray.
>>> [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
>>> order) - 2 ] bench.
>>>
>>> But it's still about twice as quick as #primDigitCompare:.
>>> '9,590,000 per second. 104 nanoseconds per run.'.
>>> So, something must be wrong with #primDigitCompare:.
>>>
>>> Levente
>>>
>>> On Wed, 13 Apr 2016, David T. Lewis wrote:
>>>
>>>> I would appreciate a review before moving this to trunk.
>>>>
>>>> Background:
>>>>
>>>> In UTCDateAndTime, most things are much faster than the trunk version.
>>>> However, DataAndTime equality check did not improve for 32-bit images,
>>>> so I ran it under AndreasSystemProfiler. Profiling showed that large
>>>> integer equality checks spends time mostly in primDigitCompare, which
>>>> is inefficient when only a simple byte comparison is needed.
>>>>
>>>> Here is the performance difference that I see on my system, 32-bit trunk
>>>> Spur on Linux:
>>>>
>>>> | a b c d |
>>>> a := 7432154326465436. "a big number"
>>>> b := a + 1. "low order digit changed"
>>>> c := 8432154326465436. "high order digit changed"
>>>> d := 7432154026465436. "a digit in the middle changed"
>>>>
>>>> "Base performance in the trunk image"
>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733"
>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152"
>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
>>>>
>>>> "Performance after adding LargePositiveInteger>>="
>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676"
>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883"
>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
>>>>
>>>> Dave
>>>>
>>>>
>>>>
>>>> On Wed, Apr 13, 2016 at 10:57:28PM +0000, [hidden email] wrote:
>>>>> David T. Lewis uploaded a new version of Kernel to project The Inbox:
>>>>> http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
>>>>>
>>>>> ==================== Summary ====================
>>>>>
>>>>> Name: Kernel-dtl.1015
>>>>> Author: dtl
>>>>> Time: 13 April 2016, 6:57:22.56608 pm
>>>>> UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e
>>>>> Ancestors: Kernel-mt.1014
>>>>>
>>>>> Make large integer equality test be about 13 times faster. Implement #=
>>>>> in LargePositiveInteger, and use digitAt: (primitive 60) for the
>>>>> comparison.
>>>>>
>>>>> =============== Diff against Kernel-mt.1014 ===============
>>>>>
>>>>> Item was added:
>>>>> + ----- Method: LargePositiveInteger>>= (in category 'comparing') -----
>>>>> + = aNumber
>>>>> +
>>>>> + aNumber class == self class ifTrue: [
>>>>> + aNumber size = self size ifFalse: [ ^false ].
>>>>> + self size to: 1 by: -1 do: [ :i | (aNumber digitAt: i) =
>>>>> (self digitAt: i) ifFalse: [ ^ false ] ].
>>>>> + ^ true ].
>>>>> + aNumber isInteger ifTrue: [ ^false ].
>>>>> + aNumber isNumber ifFalse: [ ^false ].
>>>>> + ^aNumber adaptToInteger: self andCompare: #=!
>>>>>
>>>>
>>>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

David T. Lewis
 
I don't know if the apparent slowdown in newer VMs is significant.
It might be that some inefficiences have crept in to the primitive
calling, but it may just be differences in the compilers or compiler
optimization settings that were used when they were compiled.

Overall, I would expect that primitive calling will be faster in
Cog, and primitive execution time should be reasonably similar in
any version of the VM. The differences between Cog and classic
interpreter suggest that the cost of calling a primitive (especially
a small primitive) can be relatively high.

I have a feeling that we are reaching a point in which the jit
optimization may exceed the performance of compiled primitives. It
seems quite reasonable to me that a jitted method might outperform
the compiled primitive version of the same method, because the jit
version can completely avoid the overhead of setting up the call to the
compiled primitive.

If that is so, then we may have primitives that actually slow the
system down when running on Cog.

What if we find that some primitives are useful if and only if we are
running without the jit? In that case, we might want to call the
primitives when running on Cog, and not call them when running on
a plain StackInterpreter.

This is probably a dumb idea, but I'll toss it out anyway. I wonder
if there might be a way to cause certain primitives to be called
only if they are needed for a non-jit VM? For example, in Squeak
we have Environments, and we have pragmas. So I wonder if there
might be some way to do something like this:

  <primitive: 'primitiveFindSubstring' module: 'MiscPrimitivePlugin' callUnless: #cog>

Dave


On Thu, Apr 14, 2016 at 07:04:47PM +0200, Levente Uzonyi wrote:

>
> Hi Dave,
>
> I dag a bit deeper into this problem, and I found that it's been around
> for ages. I compared the two primitives in three older images.
> The first one is Squeak 4.2 running on Cog r2714:
>
> [ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> '6,880,000 per second.'
>
> | order |
> order := (0 to: 255) as: ByteArray.
> [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> order) - 2 ] bench.
> '11,200,000 per second.'
>
> The next one was an even older Cog VM running an image updated from
> 3.10.2. The VM didn't have the primitives required to fetch the VM
> information. The result were '8.22459348130374e6 per second.' and
> '1.19677724910036e7 per second.', respectively.
>
> The last image was a closure-enabled 3.10.2 running on the classic
> Interpreter VM. The results were '3.911442911417717e6 per second.' and
> '4.84567866426715e6 per second.', respectively.
>
> Since in all VMs the seemingly more complex code (primitiveCompareString
> with a subtraction) was quicker than the simpler code
> (primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled
> with less aggressive optimization than MiscPrimitivePlugin.
>
> What's also interesting is that, based on these benchmarks, the
> performance got worse over the years. I think invoking a primitive has its
> cost in newer VMs.
>
> Levente
>
> On Thu, 14 Apr 2016, David T. Lewis wrote:
>
> >Oops, I just realized that my "interpreter VM" results were from a
> >debugging
> >VM with compiler optimization off (too many VMs, sorry). Here are the
> >results
> >for a more representative interpreter VM. I can't explain the variation,
> >but
> >one conclusion is clear: my suggested "optimization" in the inbox is a bad
> >idea
> >(and I will move it to the treated inbox).
> >
> > "Base performance in the trunk level V3 image with interpreter VM"
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669"
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052"
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
> >
> > "Performance after adding LargePositiveInteger>>="
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224"
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]].  "==> 27824"
> > Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
> >
> >Dave
> >
> >
> >
> >On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
> >>Hi Levente,
> >>
> >>I think you may be right. I repeated my test with an interpreter VM on a
> >>32-bit image (with loop count smaller because the interpreter VM is slower
> >>than Spur). The change that I put in the inbox does not have any benefit
> >>for the interpreter VM:
> >>
> >>
> >>  a := 7432154326465436. "a big number"
> >>  b := a + 1. "low order digit changed"
> >>  c := 8432154326465436. "high order digit changed"
> >>  d := 7432154026465436. "a digit in the middle changed"
> >>
> >>   "Base performance in the trunk level V3 image with interpreter VM"
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844"
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786"
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
> >>
> >>  "Performance after adding LargePositiveInteger>>="
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868"
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775"
> >>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
> >>
> >>So yes it is something related to the VM. But I do not understand
> >>how #primDigitCompare could be so slow? As you say, maybe something
> >>is wrong with it.
> >>
> >>Thank you, I am glad that I asked for a review :-)
> >>
> >>Dave
> >>
> >>
> >>On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
> >>>Hi Dave,
> >>>
> >>>I guess this is a VM related issue. On 64-bit Spur
> >>>[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >>>returns '5,420,000 per second. 184 nanoseconds per run.'.
> >>>
> >>>Removing the primitive call from #digitCompare: the number goes up:
> >>>'20,200,000 per second. 49.4 nanoseconds per run.'.
> >>>
> >>>You might think that it's okay because the JIT is so good. But we have
> >>>another primitive to compare two bytes-objects. One which ought be
> >>>slower than #primDigitCompare:, because it maps the bytes before doing
> >>>the
> >>>comparison. I even subtracted two from its result to match the result of
> >>>#digitCompare:
> >>>
> >>>| order |
> >>>order := (0 to: 255) as: ByteArray.
> >>>[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >>>order) - 2 ] bench.
> >>>
> >>>But it's still about twice as quick as #primDigitCompare:.
> >>>'9,590,000 per second. 104 nanoseconds per run.'.
> >>>So, something must be wrong with #primDigitCompare:.
> >>>
> >>>Levente
> >>>
> >>>On Wed, 13 Apr 2016, David T. Lewis wrote:
> >>>
> >>>>I would appreciate a review before moving this to trunk.
> >>>>
> >>>>Background:
> >>>>
> >>>>In UTCDateAndTime, most things are much faster than the trunk version.
> >>>>However, DataAndTime equality check did not improve for 32-bit images,
> >>>>so I ran it under AndreasSystemProfiler. Profiling showed that large
> >>>>integer equality checks spends time mostly in primDigitCompare, which
> >>>>is inefficient when only a simple byte comparison is needed.
> >>>>
> >>>>Here is the performance difference that I see on my system, 32-bit trunk
> >>>>Spur on Linux:
> >>>>
> >>>>| a b c d |
> >>>>a := 7432154326465436. "a big number"
> >>>>b := a + 1. "low order digit changed"
> >>>>c := 8432154326465436. "high order digit changed"
> >>>>d := 7432154026465436. "a digit in the middle changed"
> >>>>
> >>>>"Base performance in the trunk image"
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733"
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152"
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
> >>>>
> >>>>"Performance after adding LargePositiveInteger>>="
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676"
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883"
> >>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
> >>>>
> >>>>Dave
> >>>>
> >>>>
> >>>>
> >>>>On Wed, Apr 13, 2016 at 10:57:28PM +0000, [hidden email]
> >>>>wrote:
> >>>>>David T. Lewis uploaded a new version of Kernel to project The Inbox:
> >>>>>http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
> >>>>>
> >>>>>==================== Summary ====================
> >>>>>
> >>>>>Name: Kernel-dtl.1015
> >>>>>Author: dtl
> >>>>>Time: 13 April 2016, 6:57:22.56608 pm
> >>>>>UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e
> >>>>>Ancestors: Kernel-mt.1014
> >>>>>
> >>>>>Make large integer equality test be about 13 times faster. Implement #=
> >>>>>in LargePositiveInteger, and use digitAt: (primitive 60) for the
> >>>>>comparison.
> >>>>>
> >>>>>=============== Diff against Kernel-mt.1014 ===============
> >>>>>
> >>>>>Item was added:
> >>>>>+ ----- Method: LargePositiveInteger>>= (in category 'comparing') -----
> >>>>>+ = aNumber
> >>>>>+
> >>>>>+ aNumber class == self class ifTrue: [
> >>>>>+ aNumber size = self size ifFalse: [ ^false ].
> >>>>>+ self size to: 1 by: -1 do: [ :i | (aNumber digitAt:
> >>>>>i) =
> >>>>>(self digitAt: i) ifFalse: [ ^ false ] ].
> >>>>>+ ^ true ].
> >>>>>+ aNumber isInteger ifTrue: [ ^false ].
> >>>>>+ aNumber isNumber ifFalse: [ ^false ].
> >>>>>+ ^aNumber adaptToInteger: self andCompare: #=!
> >>>>>
> >>>>
> >>>>
> >
> >
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Eliot Miranda-2

Hi Dave,

> On Apr 14, 2016, at 5:09 PM, David T. Lewis <[hidden email]> wrote:
>
>
> I don't know if the apparent slowdown in newer VMs is significant.
> It might be that some inefficiences have crept in to the primitive
> calling, but it may just be differences in the compilers or compiler
> optimization settings that were used when they were compiled.
>
> Overall, I would expect that primitive calling will be faster in
> Cog, and primitive execution time should be reasonably similar in
> any version of the VM. The differences between Cog and classic
> interpreter suggest that the cost of calling a primitive (especially
> a small primitive) can be relatively high.
>
> I have a feeling that we are reaching a point in which the jit
> optimization may exceed the performance of compiled primitives. It
> seems quite reasonable to me that a jitted method might outperform
> the compiled primitive version of the same method, because the jit
> version can completely avoid the overhead of setting up the call to the
> compiled primitive.
>
> If that is so, then we may have primitives that actually slow the
> system down when running on Cog.
>
> What if we find that some primitives are useful if and only if we are
> running without the jit? In that case, we might want to call the
> primitives when running on Cog, and not call them when running on
> a plain StackInterpreter.
>
> This is probably a dumb idea, but I'll toss it out anyway. I wonder
> if there might be a way to cause certain primitives to be called
> only if they are needed for a non-jit VM? For example, in Squeak
> we have Environments, and we have pragmas. So I wonder if there
> might be some way to do something like this:
>
>  <primitive: 'primitiveFindSubstring' module: 'MiscPrimitivePlugin' callUnless: #cog>

Actually I think it's a good idea but we don't need an additional keyword and state in primitive methods.  The JIT VM could maintain a "do not call" list, or simply not include certain primitives.  What we need is accurate measurements of with and without.   I can then update the "do-not-call" list.

One issue is that for simple invocations (eg invocations with short strings as arguments) the JIT code may indeed be faster and hence the overhead if the call makes the no primitive code win, but that longer code will be favored by the simpler indexing code in the C primitive and at a certain point overhaul the JIT.  Once Sista arrives this shouldn't be an issue and all of MiscPrimitivePlugin should hopefully be redundant.

> Dave

_,,,^..^,,,_ (phone)

>
>> On Thu, Apr 14, 2016 at 07:04:47PM +0200, Levente Uzonyi wrote:
>>
>> Hi Dave,
>>
>> I dag a bit deeper into this problem, and I found that it's been around
>> for ages. I compared the two primitives in three older images.
>> The first one is Squeak 4.2 running on Cog r2714:
>>
>> [ 7432154326465436 digitCompare: 8432154326465436 ] bench.
>> '6,880,000 per second.'
>>
>> | order |
>> order := (0 to: 255) as: ByteArray.
>> [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
>> order) - 2 ] bench.
>> '11,200,000 per second.'
>>
>> The next one was an even older Cog VM running an image updated from
>> 3.10.2. The VM didn't have the primitives required to fetch the VM
>> information. The result were '8.22459348130374e6 per second.' and
>> '1.19677724910036e7 per second.', respectively.
>>
>> The last image was a closure-enabled 3.10.2 running on the classic
>> Interpreter VM. The results were '3.911442911417717e6 per second.' and
>> '4.84567866426715e6 per second.', respectively.
>>
>> Since in all VMs the seemingly more complex code (primitiveCompareString
>> with a subtraction) was quicker than the simpler code
>> (primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled
>> with less aggressive optimization than MiscPrimitivePlugin.
>>
>> What's also interesting is that, based on these benchmarks, the
>> performance got worse over the years. I think invoking a primitive has its
>> cost in newer VMs.
>>
>> Levente
>>
>>> On Thu, 14 Apr 2016, David T. Lewis wrote:
>>>
>>> Oops, I just realized that my "interpreter VM" results were from a
>>> debugging
>>> VM with compiler optimization off (too many VMs, sorry). Here are the
>>> results
>>> for a more representative interpreter VM. I can't explain the variation,
>>> but
>>> one conclusion is clear: my suggested "optimization" in the inbox is a bad
>>> idea
>>> (and I will move it to the treated inbox).
>>>
>>> "Base performance in the trunk level V3 image with interpreter VM"
>>> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669"
>>> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052"
>>> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
>>>
>>> "Performance after adding LargePositiveInteger>>="
>>> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224"
>>> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]].  "==> 27824"
>>> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
>>>
>>> Dave
>>>
>>>
>>>
>>>> On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
>>>> Hi Levente,
>>>>
>>>> I think you may be right. I repeated my test with an interpreter VM on a
>>>> 32-bit image (with loop count smaller because the interpreter VM is slower
>>>> than Spur). The change that I put in the inbox does not have any benefit
>>>> for the interpreter VM:
>>>>
>>>>
>>>> a := 7432154326465436. "a big number"
>>>> b := a + 1. "low order digit changed"
>>>> c := 8432154326465436. "high order digit changed"
>>>> d := 7432154026465436. "a digit in the middle changed"
>>>>
>>>>  "Base performance in the trunk level V3 image with interpreter VM"
>>>> Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844"
>>>> Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786"
>>>> Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
>>>>
>>>> "Performance after adding LargePositiveInteger>>="
>>>> Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868"
>>>> Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775"
>>>> Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
>>>>
>>>> So yes it is something related to the VM. But I do not understand
>>>> how #primDigitCompare could be so slow? As you say, maybe something
>>>> is wrong with it.
>>>>
>>>> Thank you, I am glad that I asked for a review :-)
>>>>
>>>> Dave
>>>>
>>>>
>>>>> On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
>>>>> Hi Dave,
>>>>>
>>>>> I guess this is a VM related issue. On 64-bit Spur
>>>>> [ 7432154326465436 digitCompare: 8432154326465436 ] bench.
>>>>> returns '5,420,000 per second. 184 nanoseconds per run.'.
>>>>>
>>>>> Removing the primitive call from #digitCompare: the number goes up:
>>>>> '20,200,000 per second. 49.4 nanoseconds per run.'.
>>>>>
>>>>> You might think that it's okay because the JIT is so good. But we have
>>>>> another primitive to compare two bytes-objects. One which ought be
>>>>> slower than #primDigitCompare:, because it maps the bytes before doing
>>>>> the
>>>>> comparison. I even subtracted two from its result to match the result of
>>>>> #digitCompare:
>>>>>
>>>>> | order |
>>>>> order := (0 to: 255) as: ByteArray.
>>>>> [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
>>>>> order) - 2 ] bench.
>>>>>
>>>>> But it's still about twice as quick as #primDigitCompare:.
>>>>> '9,590,000 per second. 104 nanoseconds per run.'.
>>>>> So, something must be wrong with #primDigitCompare:.
>>>>>
>>>>> Levente
>>>>>
>>>>>> On Wed, 13 Apr 2016, David T. Lewis wrote:
>>>>>>
>>>>>> I would appreciate a review before moving this to trunk.
>>>>>>
>>>>>> Background:
>>>>>>
>>>>>> In UTCDateAndTime, most things are much faster than the trunk version.
>>>>>> However, DataAndTime equality check did not improve for 32-bit images,
>>>>>> so I ran it under AndreasSystemProfiler. Profiling showed that large
>>>>>> integer equality checks spends time mostly in primDigitCompare, which
>>>>>> is inefficient when only a simple byte comparison is needed.
>>>>>>
>>>>>> Here is the performance difference that I see on my system, 32-bit trunk
>>>>>> Spur on Linux:
>>>>>>
>>>>>> | a b c d |
>>>>>> a := 7432154326465436. "a big number"
>>>>>> b := a + 1. "low order digit changed"
>>>>>> c := 8432154326465436. "high order digit changed"
>>>>>> d := 7432154026465436. "a digit in the middle changed"
>>>>>>
>>>>>> "Base performance in the trunk image"
>>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733"
>>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152"
>>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
>>>>>>
>>>>>> "Performance after adding LargePositiveInteger>>="
>>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676"
>>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883"
>>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
>>>>>>
>>>>>> Dave
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 13, 2016 at 10:57:28PM +0000, [hidden email]
>>>>>> wrote:
>>>>>>> David T. Lewis uploaded a new version of Kernel to project The Inbox:
>>>>>>> http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
>>>>>>>
>>>>>>> ==================== Summary ====================
>>>>>>>
>>>>>>> Name: Kernel-dtl.1015
>>>>>>> Author: dtl
>>>>>>> Time: 13 April 2016, 6:57:22.56608 pm
>>>>>>> UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e
>>>>>>> Ancestors: Kernel-mt.1014
>>>>>>>
>>>>>>> Make large integer equality test be about 13 times faster. Implement #=
>>>>>>> in LargePositiveInteger, and use digitAt: (primitive 60) for the
>>>>>>> comparison.
>>>>>>>
>>>>>>> =============== Diff against Kernel-mt.1014 ===============
>>>>>>>
>>>>>>> Item was added:
>>>>>>> + ----- Method: LargePositiveInteger>>= (in category 'comparing') -----
>>>>>>> + = aNumber
>>>>>>> +
>>>>>>> +    aNumber class == self class ifTrue: [
>>>>>>> +        aNumber size = self size ifFalse: [ ^false ].
>>>>>>> +        self size to: 1 by: -1 do: [ :i | (aNumber digitAt:
>>>>>>> i) =
>>>>>>> (self digitAt: i) ifFalse: [ ^ false ] ].
>>>>>>> +        ^ true ].
>>>>>>> +    aNumber isInteger ifTrue: [ ^false ].
>>>>>>> +    aNumber isNumber ifFalse: [ ^false ].
>>>>>>> +    ^aNumber adaptToInteger: self andCompare: #=!
>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

David T. Lewis
 
On Thu, Apr 14, 2016 at 08:01:41PM -0700, Eliot Miranda wrote:

>
> Hi Dave,
>
> > On Apr 14, 2016, at 5:09 PM, David T. Lewis <[hidden email]> wrote:
> >
> > I don't know if the apparent slowdown in newer VMs is significant.
> > It might be that some inefficiences have crept in to the primitive
> > calling, but it may just be differences in the compilers or compiler
> > optimization settings that were used when they were compiled.
> >
> > Overall, I would expect that primitive calling will be faster in
> > Cog, and primitive execution time should be reasonably similar in
> > any version of the VM. The differences between Cog and classic
> > interpreter suggest that the cost of calling a primitive (especially
> > a small primitive) can be relatively high.
> >
> > I have a feeling that we are reaching a point in which the jit
> > optimization may exceed the performance of compiled primitives. It
> > seems quite reasonable to me that a jitted method might outperform
> > the compiled primitive version of the same method, because the jit
> > version can completely avoid the overhead of setting up the call to the
> > compiled primitive.
> >
> > If that is so, then we may have primitives that actually slow the
> > system down when running on Cog.
> >
> > What if we find that some primitives are useful if and only if we are
> > running without the jit? In that case, we might want to call the
> > primitives when running on Cog, and not call them when running on
> > a plain StackInterpreter.
> >
> > This is probably a dumb idea, but I'll toss it out anyway. I wonder
> > if there might be a way to cause certain primitives to be called
> > only if they are needed for a non-jit VM? For example, in Squeak
> > we have Environments, and we have pragmas. So I wonder if there
> > might be some way to do something like this:
> >
> >  <primitive: 'primitiveFindSubstring' module: 'MiscPrimitivePlugin' callUnless: #cog>
>
> Actually I think it's a good idea but we don't need an additional keyword and state in primitive methods.  The JIT VM could maintain a "do not call" list, or simply not include certain primitives.  What we need is accurate measurements of with and without.   I can then update the "do-not-call" list.
>
> One issue is that for simple invocations (eg invocations with short strings as arguments) the JIT code may indeed be faster and hence the overhead if the call makes the no primitive code win, but that longer code will be favored by the simpler indexing code in the C primitive and at a certain point overhaul the JIT.  Once Sista arrives this shouldn't be an issue and all of MiscPrimitivePlugin should hopefully be redundant.
>

<meta>
It is really quite remarkable that we could be having this discussion
at all. An open source virtual machine for Smalltalk that is so fast that
is not even worth the trouble to call out to C primitives? Really?!? Wow.
</meta>

Dave
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Ben Coman

On Fri, Apr 15, 2016 at 11:27 AM, David T. Lewis <[hidden email]> wrote:

>
> On Thu, Apr 14, 2016 at 08:01:41PM -0700, Eliot Miranda wrote:
>>
>> Hi Dave,
>>
>> > On Apr 14, 2016, at 5:09 PM, David T. Lewis <[hidden email]> wrote:
>> >
>> > I don't know if the apparent slowdown in newer VMs is significant.
>> > It might be that some inefficiences have crept in to the primitive
>> > calling, but it may just be differences in the compilers or compiler
>> > optimization settings that were used when they were compiled.
>> >
>> > Overall, I would expect that primitive calling will be faster in
>> > Cog, and primitive execution time should be reasonably similar in
>> > any version of the VM. The differences between Cog and classic
>> > interpreter suggest that the cost of calling a primitive (especially
>> > a small primitive) can be relatively high.
>> >
>> > I have a feeling that we are reaching a point in which the jit
>> > optimization may exceed the performance of compiled primitives. It
>> > seems quite reasonable to me that a jitted method might outperform
>> > the compiled primitive version of the same method, because the jit
>> > version can completely avoid the overhead of setting up the call to the
>> > compiled primitive.
>> >
>> > If that is so, then we may have primitives that actually slow the
>> > system down when running on Cog.
>> >
>> > What if we find that some primitives are useful if and only if we are
>> > running without the jit? In that case, we might want to call the
>> > primitives when running on Cog, and not call them when running on
>> > a plain StackInterpreter.
>> >
>> > This is probably a dumb idea, but I'll toss it out anyway. I wonder
>> > if there might be a way to cause certain primitives to be called
>> > only if they are needed for a non-jit VM? For example, in Squeak
>> > we have Environments, and we have pragmas. So I wonder if there
>> > might be some way to do something like this:
>> >
>> >  <primitive: 'primitiveFindSubstring' module: 'MiscPrimitivePlugin' callUnless: #cog>
>>
>> Actually I think it's a good idea but we don't need an additional keyword and state in primitive methods.  The JIT VM could maintain a "do not call" list, or simply not include certain primitives.  What we need is accurate measurements of with and without.   I can then update the "do-not-call" list.
>>
>> One issue is that for simple invocations (eg invocations with short strings as arguments) the JIT code may indeed be faster and hence the overhead if the call makes the no primitive code win, but that longer code will be favored by the simpler indexing code in the C primitive and at a certain point overhaul the JIT.  Once Sista arrives this shouldn't be an issue and all of MiscPrimitivePlugin should hopefully be redundant.
>>
>
> <meta>
> It is really quite remarkable that we could be having this discussion
> at all. An open source virtual machine for Smalltalk that is so fast that
> is not even worth the trouble to call out to C primitives? Really?!? Wow.
> </meta>
>
> Dave

If it turns out to be true, it would certainly be a good meme for the
marketing department to work with.

cheers -ben
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Bert Freudenberg
In reply to this post by David T. Lewis
 
On 15.04.2016, at 05:27, David T. Lewis <[hidden email]> wrote:
>
> <meta>
> It is really quite remarkable that we could be having this discussion
> at all. An open source virtual machine for Smalltalk that is so fast that
> is not even worth the trouble to call out to C primitives? Really?!? Wow.
> </meta>

You would get a kick out of taking a closer look at the RSqueak VM. It doesn’t even use the primitives for BitBlt but runs the Smalltalk code.

- Bert -




smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

David T. Lewis
 
On Fri, Apr 15, 2016 at 11:51:21AM +0200, Bert Freudenberg wrote:

>  
> On 15.04.2016, at 05:27, David T. Lewis <[hidden email]> wrote:
> >
> > <meta>
> > It is really quite remarkable that we could be having this discussion
> > at all. An open source virtual machine for Smalltalk that is so fast that
> > is not even worth the trouble to call out to C primitives? Really?!? Wow.
> > </meta>
>
> You would get a kick out of taking a closer look at the RSqueak VM. It doesn???t even use the primitives for BitBlt but runs the Smalltalk code.
>

You're right, I would like to take a closer look at RSqueak. It's getting
hard to keep up with all of this progress :-)

Dave

Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

philippeback
In reply to this post by Bert Freudenberg
 
What's RSqueak VM?

Where to find it?

Phil

On Fri, Apr 15, 2016 at 11:51 AM, Bert Freudenberg <[hidden email]> wrote:
 
On 15.04.2016, at 05:27, David T. Lewis <[hidden email]> wrote:
>
> <meta>
> It is really quite remarkable that we could be having this discussion
> at all. An open source virtual machine for Smalltalk that is so fast that
> is not even worth the trouble to call out to C primitives? Really?!? Wow.
> </meta>

You would get a kick out of taking a closer look at the RSqueak VM. It doesn’t even use the primitives for BitBlt but runs the Smalltalk code.

- Bert -





Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

timfelgentreff
Hi,

RSqueakVM is a research Squeak VM written in RPython, with a tracing JIT compiler supporting x86 32 and 64bit, armv6, armv7, and powerpc (although we have deactivated non-x86-32 builds in the CI right now, because it takes forever to run through qemu-chroot and nobody has time to work on these platforms right now).

RSqueak supports all image formats starting with Squeak 2 all the way up to current trunk. So you can have a jitted mini image, if you find a filein with the Bitblt code in Smalltalk, because we don't have the bitblt plugin ;)

The JIT is fast enough to omit almost all non-essential primitives, and simulate many primitives (like Balloon and BitBlt) by running the Slang code instead of a plugin. In fact, we don't have large integer, misc, bitblt, or balloon plugins, instead relying on the fallback and vmmaker code to make execution fast. For many cases this works quite well, although warmup still is a big issue for us (the framerate of the image keeps steadily climbing for the first few *minutes* that you use it - there is just a lot of stuff that needs to be jitted for bitblt, balloon, and fontrendering to get fast).

We are currently preparing an alpha release as a one-click bundle for people to try it out.

Development is currently here: https://github.com/HPI-SWA-Lab/RSqueak. Binaries are built on each commit automatically, but you also need SDL2 and a current image with VMMaker (we need a number of fixes in vmmaker and the fallback code - so older images need these fixes somehow filed in, too - we're also preparing changesets for those)

philippeback wrote
What's RSqueak VM?

Where to find it?

Phil

On Fri, Apr 15, 2016 at 11:51 AM, Bert Freudenberg <[hidden email]>
wrote:

>
> On 15.04.2016, at 05:27, David T. Lewis <[hidden email]> wrote:
> >
> > <meta>
> > It is really quite remarkable that we could be having this discussion
> > at all. An open source virtual machine for Smalltalk that is so fast that
> > is not even worth the trouble to call out to C primitives? Really?!? Wow.
> > </meta>
>
> You would get a kick out of taking a closer look at the RSqueak VM. It
> doesn’t even use the primitives for BitBlt but runs the Smalltalk code.
>
> - Bert -
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Levente Uzonyi
In reply to this post by Levente Uzonyi
 
This thread got derailed, and I feel like we should go back to its
main point, which was that #digitCompare: using #primitiveDigitCompare is
slow.
It's so slow, that #primitiveCompareString is about twice as quick in
Spur VMs, which is odd.

Levente

On Thu, 14 Apr 2016, Levente Uzonyi wrote:

> Hi Dave,
>
> I dag a bit deeper into this problem, and I found that it's been around for
> ages. I compared the two primitives in three older images.
> The first one is Squeak 4.2 running on Cog r2714:
>
> [ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> '6,880,000 per second.'
>
> | order |
> order := (0 to: 255) as: ByteArray.
> [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> order) - 2 ] bench.
> '11,200,000 per second.'
>
> The next one was an even older Cog VM running an image updated from 3.10.2.
> The VM didn't have the primitives required to fetch the VM information. The
> result were '8.22459348130374e6 per second.' and
> '1.19677724910036e7 per second.', respectively.
>
> The last image was a closure-enabled 3.10.2 running on the classic
> Interpreter VM. The results were '3.911442911417717e6 per second.' and
> '4.84567866426715e6 per second.', respectively.
>
> Since in all VMs the seemingly more complex code (primitiveCompareString with
> a subtraction) was quicker than the simpler code (primitiveDigitCompare), I
> suspect that LargeIntegersPlugin is compiled
> with less aggressive optimization than MiscPrimitivePlugin.
>
> What's also interesting is that, based on these benchmarks, the performance
> got worse over the years. I think invoking a primitive has its cost in newer
> VMs.
>
> Levente
>
> On Thu, 14 Apr 2016, David T. Lewis wrote:
>
>> Oops, I just realized that my "interpreter VM" results were from a
>> debugging
>> VM with compiler optimization off (too many VMs, sorry). Here are the
>> results
>> for a more representative interpreter VM. I can't explain the variation,
>> but
>> one conclusion is clear: my suggested "optimization" in the inbox is a bad
>> idea
>> (and I will move it to the treated inbox).
>>
>>  "Base performance in the trunk level V3 image with interpreter VM"
>>  Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669"
>>  Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052"
>>  Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
>>
>>  "Performance after adding LargePositiveInteger>>="
>>  Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224"
>>  Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]].  "==> 27824"
>>  Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
>>
>> Dave
>>
>>
>>
>> On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
>>> Hi Levente,
>>>
>>> I think you may be right. I repeated my test with an interpreter VM on a
>>> 32-bit image (with loop count smaller because the interpreter VM is slower
>>> than Spur). The change that I put in the inbox does not have any benefit
>>> for the interpreter VM:
>>>
>>>
>>>   a := 7432154326465436. "a big number"
>>>   b := a + 1. "low order digit changed"
>>>   c := 8432154326465436. "high order digit changed"
>>>   d := 7432154026465436. "a digit in the middle changed"
>>>
>>>    "Base performance in the trunk level V3 image with interpreter VM"
>>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844"
>>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786"
>>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
>>>
>>>   "Performance after adding LargePositiveInteger>>="
>>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868"
>>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775"
>>>   Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
>>>
>>> So yes it is something related to the VM. But I do not understand
>>> how #primDigitCompare could be so slow? As you say, maybe something
>>> is wrong with it.
>>>
>>> Thank you, I am glad that I asked for a review :-)
>>>
>>> Dave
>>>
>>>
>>> On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
>>>> Hi Dave,
>>>>
>>>> I guess this is a VM related issue. On 64-bit Spur
>>>> [ 7432154326465436 digitCompare: 8432154326465436 ] bench.
>>>> returns '5,420,000 per second. 184 nanoseconds per run.'.
>>>>
>>>> Removing the primitive call from #digitCompare: the number goes up:
>>>> '20,200,000 per second. 49.4 nanoseconds per run.'.
>>>>
>>>> You might think that it's okay because the JIT is so good. But we have
>>>> another primitive to compare two bytes-objects. One which ought be
>>>> slower than #primDigitCompare:, because it maps the bytes before doing
>>>> the
>>>> comparison. I even subtracted two from its result to match the result of
>>>> #digitCompare:
>>>>
>>>> | order |
>>>> order := (0 to: 255) as: ByteArray.
>>>> [ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
>>>> order) - 2 ] bench.
>>>>
>>>> But it's still about twice as quick as #primDigitCompare:.
>>>> '9,590,000 per second. 104 nanoseconds per run.'.
>>>> So, something must be wrong with #primDigitCompare:.
>>>>
>>>> Levente
>>>>
>>>> On Wed, 13 Apr 2016, David T. Lewis wrote:
>>>>
>>>>> I would appreciate a review before moving this to trunk.
>>>>>
>>>>> Background:
>>>>>
>>>>> In UTCDateAndTime, most things are much faster than the trunk version.
>>>>> However, DataAndTime equality check did not improve for 32-bit images,
>>>>> so I ran it under AndreasSystemProfiler. Profiling showed that large
>>>>> integer equality checks spends time mostly in primDigitCompare, which
>>>>> is inefficient when only a simple byte comparison is needed.
>>>>>
>>>>> Here is the performance difference that I see on my system, 32-bit trunk
>>>>> Spur on Linux:
>>>>>
>>>>> | a b c d |
>>>>> a := 7432154326465436. "a big number"
>>>>> b := a + 1. "low order digit changed"
>>>>> c := 8432154326465436. "high order digit changed"
>>>>> d := 7432154026465436. "a digit in the middle changed"
>>>>>
>>>>> "Base performance in the trunk image"
>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733"
>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152"
>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
>>>>>
>>>>> "Performance after adding LargePositiveInteger>>="
>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676"
>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883"
>>>>> Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
>>>>>
>>>>> Dave
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 13, 2016 at 10:57:28PM +0000, [hidden email]
>>>>> wrote:
>>>>>> David T. Lewis uploaded a new version of Kernel to project The Inbox:
>>>>>> http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
>>>>>>
>>>>>> ==================== Summary ====================
>>>>>>
>>>>>> Name: Kernel-dtl.1015
>>>>>> Author: dtl
>>>>>> Time: 13 April 2016, 6:57:22.56608 pm
>>>>>> UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e
>>>>>> Ancestors: Kernel-mt.1014
>>>>>>
>>>>>> Make large integer equality test be about 13 times faster. Implement #=
>>>>>> in LargePositiveInteger, and use digitAt: (primitive 60) for the
>>>>>> comparison.
>>>>>>
>>>>>> =============== Diff against Kernel-mt.1014 ===============
>>>>>>
>>>>>> Item was added:
>>>>>> + ----- Method: LargePositiveInteger>>= (in category 'comparing') -----
>>>>>> + = aNumber
>>>>>> +
>>>>>> + aNumber class == self class ifTrue: [
>>>>>> + aNumber size = self size ifFalse: [ ^false ].
>>>>>> + self size to: 1 by: -1 do: [ :i | (aNumber digitAt:
>>>>>> i) =
>>>>>> (self digitAt: i) ifFalse: [ ^ false ] ].
>>>>>> + ^ true ].
>>>>>> + aNumber isInteger ifTrue: [ ^false ].
>>>>>> + aNumber isNumber ifFalse: [ ^false ].
>>>>>> + ^aNumber adaptToInteger: self andCompare: #=!
>>>>>>
>>>>>
>>>>>
>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

David T. Lewis
 
It does seem odd that it would be different on Spur. You suggested that
compiler optimization might be set differently for the two plugins. I
cannot check (I'm on an Ubuntu laptop, cannot build Spur due to some
kind of autotools problem that I can't figure out right now), but if you
are able to do a VM compile and save the console output to a file, then
I think you can look at the CFLAGS setting for the two plugins ($ nohup ./mvm,
but first comment out the part of mvm that asks if you want to clean).

Aside from that, the actual generated code for primDigitCompare is
doing quite a lot of stuff before it ever gets around to comparing the
digits. I suspect it is not a very efficient primitive.

Dave


On Fri, Apr 15, 2016 at 05:52:02PM +0200, Levente Uzonyi wrote:

>
> This thread got derailed, and I feel like we should go back to its
> main point, which was that #digitCompare: using #primitiveDigitCompare is
> slow.
> It's so slow, that #primitiveCompareString is about twice as quick in
> Spur VMs, which is odd.
>
> Levente
>
> On Thu, 14 Apr 2016, Levente Uzonyi wrote:
>
> >Hi Dave,
> >
> >I dag a bit deeper into this problem, and I found that it's been around
> >for ages. I compared the two primitives in three older images.
> >The first one is Squeak 4.2 running on Cog r2714:
> >
> >[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >'6,880,000 per second.'
> >
> >| order |
> >order := (0 to: 255) as: ByteArray.
> >[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >order) - 2 ] bench.
> >'11,200,000 per second.'
> >
> >The next one was an even older Cog VM running an image updated from
> >3.10.2. The VM didn't have the primitives required to fetch the VM
> >information. The result were '8.22459348130374e6 per second.' and
> >'1.19677724910036e7 per second.', respectively.
> >
> >The last image was a closure-enabled 3.10.2 running on the classic
> >Interpreter VM. The results were '3.911442911417717e6 per second.' and
> >'4.84567866426715e6 per second.', respectively.
> >
> >Since in all VMs the seemingly more complex code (primitiveCompareString
> >with a subtraction) was quicker than the simpler code
> >(primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled
> >with less aggressive optimization than MiscPrimitivePlugin.
> >
> >What's also interesting is that, based on these benchmarks, the
> >performance got worse over the years. I think invoking a primitive has its
> >cost in newer VMs.
> >
> >Levente
> >
> >On Thu, 14 Apr 2016, David T. Lewis wrote:
> >
> >>Oops, I just realized that my "interpreter VM" results were from a
> >>debugging
> >>VM with compiler optimization off (too many VMs, sorry). Here are the
> >>results
> >>for a more representative interpreter VM. I can't explain the variation,
> >>but
> >>one conclusion is clear: my suggested "optimization" in the inbox is a
> >>bad idea
> >>(and I will move it to the treated inbox).
> >>
> >> "Base performance in the trunk level V3 image with interpreter VM"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
> >>
> >> "Performance after adding LargePositiveInteger>>="
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]].  "==> 27824"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
> >>
> >>Dave
> >>
> >>
> >>
> >>On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
> >>>Hi Levente,
> >>>
> >>>I think you may be right. I repeated my test with an interpreter VM on a
> >>>32-bit image (with loop count smaller because the interpreter VM is
> >>>slower
> >>>than Spur). The change that I put in the inbox does not have any benefit
> >>>for the interpreter VM:
> >>>
> >>>
> >>>  a := 7432154326465436. "a big number"
> >>>  b := a + 1. "low order digit changed"
> >>>  c := 8432154326465436. "high order digit changed"
> >>>  d := 7432154026465436. "a digit in the middle changed"
> >>>
> >>>   "Base performance in the trunk level V3 image with interpreter VM"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
> >>>
> >>>  "Performance after adding LargePositiveInteger>>="
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
> >>>
> >>>So yes it is something related to the VM. But I do not understand
> >>>how #primDigitCompare could be so slow? As you say, maybe something
> >>>is wrong with it.
> >>>
> >>>Thank you, I am glad that I asked for a review :-)
> >>>
> >>>Dave
> >>>
> >>>
> >>>On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
> >>>>Hi Dave,
> >>>>
> >>>>I guess this is a VM related issue. On 64-bit Spur
> >>>>[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >>>>returns '5,420,000 per second. 184 nanoseconds per run.'.
> >>>>
> >>>>Removing the primitive call from #digitCompare: the number goes up:
> >>>>'20,200,000 per second. 49.4 nanoseconds per run.'.
> >>>>
> >>>>You might think that it's okay because the JIT is so good. But we have
> >>>>another primitive to compare two bytes-objects. One which ought be
> >>>>slower than #primDigitCompare:, because it maps the bytes before doing
> >>>>the
> >>>>comparison. I even subtracted two from its result to match the result of
> >>>>#digitCompare:
> >>>>
> >>>>| order |
> >>>>order := (0 to: 255) as: ByteArray.
> >>>>[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >>>>order) - 2 ] bench.
> >>>>
> >>>>But it's still about twice as quick as #primDigitCompare:.
> >>>>'9,590,000 per second. 104 nanoseconds per run.'.
> >>>>So, something must be wrong with #primDigitCompare:.
> >>>>
> >>>>Levente
> >>>>
> >>>>On Wed, 13 Apr 2016, David T. Lewis wrote:
> >>>>
> >>>>>I would appreciate a review before moving this to trunk.
> >>>>>
> >>>>>Background:
> >>>>>
> >>>>>In UTCDateAndTime, most things are much faster than the trunk version.
> >>>>>However, DataAndTime equality check did not improve for 32-bit images,
> >>>>>so I ran it under AndreasSystemProfiler. Profiling showed that large
> >>>>>integer equality checks spends time mostly in primDigitCompare, which
> >>>>>is inefficient when only a simple byte comparison is needed.
> >>>>>
> >>>>>Here is the performance difference that I see on my system, 32-bit
> >>>>>trunk
> >>>>>Spur on Linux:
> >>>>>
> >>>>>| a b c d |
> >>>>>a := 7432154326465436. "a big number"
> >>>>>b := a + 1. "low order digit changed"
> >>>>>c := 8432154326465436. "high order digit changed"
> >>>>>d := 7432154026465436. "a digit in the middle changed"
> >>>>>
> >>>>>"Base performance in the trunk image"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
> >>>>>
> >>>>>"Performance after adding LargePositiveInteger>>="
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
> >>>>>
> >>>>>Dave
> >>>>>
> >>>>>
> >>>>>
> >>>>>On Wed, Apr 13, 2016 at 10:57:28PM +0000, [hidden email]
> >>>>>wrote:
> >>>>>>David T. Lewis uploaded a new version of Kernel to project The Inbox:
> >>>>>>http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
> >>>>>>
> >>>>>>==================== Summary ====================
> >>>>>>
> >>>>>>Name: Kernel-dtl.1015
> >>>>>>Author: dtl
> >>>>>>Time: 13 April 2016, 6:57:22.56608 pm
> >>>>>>UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e
> >>>>>>Ancestors: Kernel-mt.1014
> >>>>>>
> >>>>>>Make large integer equality test be about 13 times faster. Implement
> >>>>>>#=
> >>>>>>in LargePositiveInteger, and use digitAt: (primitive 60) for the
> >>>>>>comparison.
> >>>>>>
> >>>>>>=============== Diff against Kernel-mt.1014 ===============
> >>>>>>
> >>>>>>Item was added:
> >>>>>>+ ----- Method: LargePositiveInteger>>= (in category 'comparing')
> >>>>>>-----
> >>>>>>+ = aNumber
> >>>>>>+
> >>>>>>+ aNumber class == self class ifTrue: [
> >>>>>>+ aNumber size = self size ifFalse: [ ^false ].
> >>>>>>+ self size to: 1 by: -1 do: [ :i | (aNumber digitAt:
> >>>>>>i) =
> >>>>>>(self digitAt: i) ifFalse: [ ^ false ] ].
> >>>>>>+ ^ true ].
> >>>>>>+ aNumber isInteger ifTrue: [ ^false ].
> >>>>>>+ aNumber isNumber ifFalse: [ ^false ].
> >>>>>>+ ^aNumber adaptToInteger: self andCompare: #=!
> >>>>>>
> >>>>>
> >>>>>
> >>
> >>
> >
> >
Reply | Threaded
Open this post in threaded view
|

RSqueakVM (was: primitiveDigitCompare is slow)

David T. Lewis
In reply to this post by timfelgentreff
 
(changing the subject line)

Thanks very much for this overview of RSqueakVM!

Dave

On Fri, Apr 15, 2016 at 06:22:03AM -0700, timfelgentreff wrote:

>
> Hi,
>
> RSqueakVM is a research Squeak VM written in RPython, with a tracing JIT
> compiler supporting x86 32 and 64bit, armv6, armv7, and powerpc (although we
> have deactivated non-x86-32 builds in the CI right now, because it takes
> forever to run through qemu-chroot and nobody has time to work on these
> platforms right now).
>
> RSqueak supports all image formats starting with Squeak 2 all the way up to
> current trunk. So you can have a jitted mini image, if you find a filein
> with the Bitblt code in Smalltalk, because we don't have the bitblt plugin
> ;)
>
> The JIT is fast enough to omit almost all non-essential primitives, and
> simulate many primitives (like Balloon and BitBlt) by running the Slang code
> instead of a plugin. In fact, we don't have large integer, misc, bitblt, or
> balloon plugins, instead relying on the fallback and vmmaker code to make
> execution fast. For many cases this works quite well, although warmup still
> is a big issue for us (the framerate of the image keeps steadily climbing
> for the first few *minutes* that you use it - there is just a lot of stuff
> that needs to be jitted for bitblt, balloon, and fontrendering to get fast).
>
> We are currently preparing an alpha release as a one-click bundle for people
> to try it out.
>
> Development is currently here: https://github.com/HPI-SWA-Lab/RSqueak.
> Binaries are built on each commit automatically, but you also need SDL2 and
> a current image with VMMaker (we need a number of fixes in vmmaker and the
> fallback code - so older images need these fixes somehow filed in, too -
> we're also preparing changesets for those)
>
>
> philippeback wrote
> > What's RSqueak VM?
> >
> > Where to find it?
> >
> > Phil
> >
> > On Fri, Apr 15, 2016 at 11:51 AM, Bert Freudenberg &lt;
>
> > bert@
>
> > &gt;
> > wrote:
> >
> >>
> >> On 15.04.2016, at 05:27, David T. Lewis &lt;
>
> > lewis@.msen
>
> > &gt; wrote:
> >> >
> >> >
> > <meta>
> >> > It is really quite remarkable that we could be having this discussion
> >> > at all. An open source virtual machine for Smalltalk that is so fast
> >> that
> >> > is not even worth the trouble to call out to C primitives? Really?!?
> >> Wow.
> >> >
> > </meta>
> >>
> >> You would get a kick out of taking a closer look at the RSqueak VM. It
> >> doesn???t even use the primitives for BitBlt but runs the Smalltalk code.
> >>
> >> - Bert -
> >>
> >>
> >>
> >>
> >>
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/primitiveDigitCompare-is-slow-was-Re-squeak-dev-The-Inbox-Kernel-dtl-1015-mcz-tp4890070p4890212.html
> Sent from the Squeak VM mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Nicolas Cellier
In reply to this post by David T. Lewis
 
Hi,
the primitive does nothing very special
1) check for SmallInteger cases first, for quick return
2) check for LargeIntegers length then if both receiver & argument are Large
3) check for LargeIntegers digits then if both have same length

None of these 3 steps is expected to be slow.

A bleeding age VM does the 3rd step using 32 bits limbs instead of 8bits but this does not change a thing about performances ratio, it could only make a difference for giant integers, these ones fit on 56 bits...
I got these with 32bits Cog Sput MacOSX :

[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
 '8,980,000 per second. 111 nanoseconds per run.'

| order |
order := (0 to: 255) as: ByteArray.
[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench.
 '16,400,000 per second. 60.8 nanoseconds per run.'

Obviously, most time is spent elsewhere, it would be interesting to know where exactly.

Nicolas

2016-04-16 18:52 GMT+02:00 David T. Lewis <[hidden email]>:

It does seem odd that it would be different on Spur. You suggested that
compiler optimization might be set differently for the two plugins. I
cannot check (I'm on an Ubuntu laptop, cannot build Spur due to some
kind of autotools problem that I can't figure out right now), but if you
are able to do a VM compile and save the console output to a file, then
I think you can look at the CFLAGS setting for the two plugins ($ nohup ./mvm,
but first comment out the part of mvm that asks if you want to clean).

Aside from that, the actual generated code for primDigitCompare is
doing quite a lot of stuff before it ever gets around to comparing the
digits. I suspect it is not a very efficient primitive.

Dave


On Fri, Apr 15, 2016 at 05:52:02PM +0200, Levente Uzonyi wrote:
>
> This thread got derailed, and I feel like we should go back to its
> main point, which was that #digitCompare: using #primitiveDigitCompare is
> slow.
> It's so slow, that #primitiveCompareString is about twice as quick in
> Spur VMs, which is odd.
>
> Levente
>
> On Thu, 14 Apr 2016, Levente Uzonyi wrote:
>
> >Hi Dave,
> >
> >I dag a bit deeper into this problem, and I found that it's been around
> >for ages. I compared the two primitives in three older images.
> >The first one is Squeak 4.2 running on Cog r2714:
> >
> >[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >'6,880,000 per second.'
> >
> >| order |
> >order := (0 to: 255) as: ByteArray.
> >[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >order) - 2 ] bench.
> >'11,200,000 per second.'
> >
> >The next one was an even older Cog VM running an image updated from
> >3.10.2. The VM didn't have the primitives required to fetch the VM
> >information. The result were '8.22459348130374e6 per second.' and
> >'1.19677724910036e7 per second.', respectively.
> >
> >The last image was a closure-enabled 3.10.2 running on the classic
> >Interpreter VM. The results were '3.911442911417717e6 per second.' and
> >'4.84567866426715e6 per second.', respectively.
> >
> >Since in all VMs the seemingly more complex code (primitiveCompareString
> >with a subtraction) was quicker than the simpler code
> >(primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled
> >with less aggressive optimization than MiscPrimitivePlugin.
> >
> >What's also interesting is that, based on these benchmarks, the
> >performance got worse over the years. I think invoking a primitive has its
> >cost in newer VMs.
> >
> >Levente
> >
> >On Thu, 14 Apr 2016, David T. Lewis wrote:
> >
> >>Oops, I just realized that my "interpreter VM" results were from a
> >>debugging
> >>VM with compiler optimization off (too many VMs, sorry). Here are the
> >>results
> >>for a more representative interpreter VM. I can't explain the variation,
> >>but
> >>one conclusion is clear: my suggested "optimization" in the inbox is a
> >>bad idea
> >>(and I will move it to the treated inbox).
> >>
> >> "Base performance in the trunk level V3 image with interpreter VM"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
> >>
> >> "Performance after adding LargePositiveInteger>>="
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]].  "==> 27824"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
> >>
> >>Dave
> >>
> >>
> >>
> >>On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
> >>>Hi Levente,
> >>>
> >>>I think you may be right. I repeated my test with an interpreter VM on a
> >>>32-bit image (with loop count smaller because the interpreter VM is
> >>>slower
> >>>than Spur). The change that I put in the inbox does not have any benefit
> >>>for the interpreter VM:
> >>>
> >>>
> >>>  a := 7432154326465436. "a big number"
> >>>  b := a + 1. "low order digit changed"
> >>>  c := 8432154326465436. "high order digit changed"
> >>>  d := 7432154026465436. "a digit in the middle changed"
> >>>
> >>>   "Base performance in the trunk level V3 image with interpreter VM"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
> >>>
> >>>  "Performance after adding LargePositiveInteger>>="
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
> >>>
> >>>So yes it is something related to the VM. But I do not understand
> >>>how #primDigitCompare could be so slow? As you say, maybe something
> >>>is wrong with it.
> >>>
> >>>Thank you, I am glad that I asked for a review :-)
> >>>
> >>>Dave
> >>>
> >>>
> >>>On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
> >>>>Hi Dave,
> >>>>
> >>>>I guess this is a VM related issue. On 64-bit Spur
> >>>>[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >>>>returns '5,420,000 per second. 184 nanoseconds per run.'.
> >>>>
> >>>>Removing the primitive call from #digitCompare: the number goes up:
> >>>>'20,200,000 per second. 49.4 nanoseconds per run.'.
> >>>>
> >>>>You might think that it's okay because the JIT is so good. But we have
> >>>>another primitive to compare two bytes-objects. One which ought be
> >>>>slower than #primDigitCompare:, because it maps the bytes before doing
> >>>>the
> >>>>comparison. I even subtracted two from its result to match the result of
> >>>>#digitCompare:
> >>>>
> >>>>| order |
> >>>>order := (0 to: 255) as: ByteArray.
> >>>>[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >>>>order) - 2 ] bench.
> >>>>
> >>>>But it's still about twice as quick as #primDigitCompare:.
> >>>>'9,590,000 per second. 104 nanoseconds per run.'.
> >>>>So, something must be wrong with #primDigitCompare:.
> >>>>
> >>>>Levente
> >>>>
> >>>>On Wed, 13 Apr 2016, David T. Lewis wrote:
> >>>>
> >>>>>I would appreciate a review before moving this to trunk.
> >>>>>
> >>>>>Background:
> >>>>>
> >>>>>In UTCDateAndTime, most things are much faster than the trunk version.
> >>>>>However, DataAndTime equality check did not improve for 32-bit images,
> >>>>>so I ran it under AndreasSystemProfiler. Profiling showed that large
> >>>>>integer equality checks spends time mostly in primDigitCompare, which
> >>>>>is inefficient when only a simple byte comparison is needed.
> >>>>>
> >>>>>Here is the performance difference that I see on my system, 32-bit
> >>>>>trunk
> >>>>>Spur on Linux:
> >>>>>
> >>>>>| a b c d |
> >>>>>a := 7432154326465436. "a big number"
> >>>>>b := a + 1. "low order digit changed"
> >>>>>c := 8432154326465436. "high order digit changed"
> >>>>>d := 7432154026465436. "a digit in the middle changed"
> >>>>>
> >>>>>"Base performance in the trunk image"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
> >>>>>
> >>>>>"Performance after adding LargePositiveInteger>>="
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
> >>>>>
> >>>>>Dave
> >>>>>
> >>>>>
> >>>>>
> >>>>>On Wed, Apr 13, 2016 at 10:57:28PM +0000, [hidden email]
> >>>>>wrote:
> >>>>>>David T. Lewis uploaded a new version of Kernel to project The Inbox:
> >>>>>>http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
> >>>>>>
> >>>>>>==================== Summary ====================
> >>>>>>
> >>>>>>Name: Kernel-dtl.1015
> >>>>>>Author: dtl
> >>>>>>Time: 13 April 2016, 6:57:22.56608 pm
> >>>>>>UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e
> >>>>>>Ancestors: Kernel-mt.1014
> >>>>>>
> >>>>>>Make large integer equality test be about 13 times faster. Implement
> >>>>>>#=
> >>>>>>in LargePositiveInteger, and use digitAt: (primitive 60) for the
> >>>>>>comparison.
> >>>>>>
> >>>>>>=============== Diff against Kernel-mt.1014 ===============
> >>>>>>
> >>>>>>Item was added:
> >>>>>>+ ----- Method: LargePositiveInteger>>= (in category 'comparing')
> >>>>>>-----
> >>>>>>+ = aNumber
> >>>>>>+
> >>>>>>+       aNumber class == self class ifTrue: [
> >>>>>>+               aNumber size = self size ifFalse: [ ^false ].
> >>>>>>+               self size to: 1 by: -1 do: [ :i | (aNumber digitAt:
> >>>>>>i) =
> >>>>>>(self digitAt: i) ifFalse: [ ^ false ] ].
> >>>>>>+               ^ true ].
> >>>>>>+       aNumber isInteger ifTrue: [ ^false ].
> >>>>>>+       aNumber isNumber ifFalse: [ ^false ].
> >>>>>>+       ^aNumber adaptToInteger: self andCompare: #=!
> >>>>>>
> >>>>>
> >>>>>
> >>
> >>
> >
> >

Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Nicolas Cellier
 
Ah but maybe the "Smart" side of the plugin is causing trouble:

    firstInteger := self
                primitive: 'primDigitCompare'
                parameters: #(#Integer )
                receiver: #Integer.

translates as:

        success(isKindOf(stackValue(0), "Integer"));
        secondInteger = stackValue(0);
        /* missing DebugCode */;
        success(isKindOf(stackValue(1), "Integer"));
        firstInteger = stackValue(1);

It might be faster to just check the 3 cases:

(interpreterProxy isIntegerObject: oop) or:
    [oopClass := interpreterProxy fetchClassOf: oop.
     oopClass == interpreterProxy classLargeNegativeInteger or: [oopClass == interpreterProxy classLargePositiveInteger]].

Moreover, we already test isIntegerObject further in the primitive code.
Since every LargeIntegersPlugin primitive is going thru this isKindOf check, there might be some low hanging fruits.

2016-04-16 19:37 GMT+02:00 Nicolas Cellier <[hidden email]>:
Hi,
the primitive does nothing very special
1) check for SmallInteger cases first, for quick return
2) check for LargeIntegers length then if both receiver & argument are Large
3) check for LargeIntegers digits then if both have same length

None of these 3 steps is expected to be slow.

A bleeding age VM does the 3rd step using 32 bits limbs instead of 8bits but this does not change a thing about performances ratio, it could only make a difference for giant integers, these ones fit on 56 bits...
I got these with 32bits Cog Sput MacOSX :

[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
 '8,980,000 per second. 111 nanoseconds per run.'

| order |
order := (0 to: 255) as: ByteArray.
[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench.
 '16,400,000 per second. 60.8 nanoseconds per run.'

Obviously, most time is spent elsewhere, it would be interesting to know where exactly.

Nicolas


2016-04-16 18:52 GMT+02:00 David T. Lewis <[hidden email]>:

It does seem odd that it would be different on Spur. You suggested that
compiler optimization might be set differently for the two plugins. I
cannot check (I'm on an Ubuntu laptop, cannot build Spur due to some
kind of autotools problem that I can't figure out right now), but if you
are able to do a VM compile and save the console output to a file, then
I think you can look at the CFLAGS setting for the two plugins ($ nohup ./mvm,
but first comment out the part of mvm that asks if you want to clean).

Aside from that, the actual generated code for primDigitCompare is
doing quite a lot of stuff before it ever gets around to comparing the
digits. I suspect it is not a very efficient primitive.

Dave


On Fri, Apr 15, 2016 at 05:52:02PM +0200, Levente Uzonyi wrote:
>
> This thread got derailed, and I feel like we should go back to its
> main point, which was that #digitCompare: using #primitiveDigitCompare is
> slow.
> It's so slow, that #primitiveCompareString is about twice as quick in
> Spur VMs, which is odd.
>
> Levente
>
> On Thu, 14 Apr 2016, Levente Uzonyi wrote:
>
> >Hi Dave,
> >
> >I dag a bit deeper into this problem, and I found that it's been around
> >for ages. I compared the two primitives in three older images.
> >The first one is Squeak 4.2 running on Cog r2714:
> >
> >[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >'6,880,000 per second.'
> >
> >| order |
> >order := (0 to: 255) as: ByteArray.
> >[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >order) - 2 ] bench.
> >'11,200,000 per second.'
> >
> >The next one was an even older Cog VM running an image updated from
> >3.10.2. The VM didn't have the primitives required to fetch the VM
> >information. The result were '8.22459348130374e6 per second.' and
> >'1.19677724910036e7 per second.', respectively.
> >
> >The last image was a closure-enabled 3.10.2 running on the classic
> >Interpreter VM. The results were '3.911442911417717e6 per second.' and
> >'4.84567866426715e6 per second.', respectively.
> >
> >Since in all VMs the seemingly more complex code (primitiveCompareString
> >with a subtraction) was quicker than the simpler code
> >(primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled
> >with less aggressive optimization than MiscPrimitivePlugin.
> >
> >What's also interesting is that, based on these benchmarks, the
> >performance got worse over the years. I think invoking a primitive has its
> >cost in newer VMs.
> >
> >Levente
> >
> >On Thu, 14 Apr 2016, David T. Lewis wrote:
> >
> >>Oops, I just realized that my "interpreter VM" results were from a
> >>debugging
> >>VM with compiler optimization off (too many VMs, sorry). Here are the
> >>results
> >>for a more representative interpreter VM. I can't explain the variation,
> >>but
> >>one conclusion is clear: my suggested "optimization" in the inbox is a
> >>bad idea
> >>(and I will move it to the treated inbox).
> >>
> >> "Base performance in the trunk level V3 image with interpreter VM"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
> >>
> >> "Performance after adding LargePositiveInteger>>="
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]].  "==> 27824"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
> >>
> >>Dave
> >>
> >>
> >>
> >>On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
> >>>Hi Levente,
> >>>
> >>>I think you may be right. I repeated my test with an interpreter VM on a
> >>>32-bit image (with loop count smaller because the interpreter VM is
> >>>slower
> >>>than Spur). The change that I put in the inbox does not have any benefit
> >>>for the interpreter VM:
> >>>
> >>>
> >>>  a := 7432154326465436. "a big number"
> >>>  b := a + 1. "low order digit changed"
> >>>  c := 8432154326465436. "high order digit changed"
> >>>  d := 7432154026465436. "a digit in the middle changed"
> >>>
> >>>   "Base performance in the trunk level V3 image with interpreter VM"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
> >>>
> >>>  "Performance after adding LargePositiveInteger>>="
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
> >>>
> >>>So yes it is something related to the VM. But I do not understand
> >>>how #primDigitCompare could be so slow? As you say, maybe something
> >>>is wrong with it.
> >>>
> >>>Thank you, I am glad that I asked for a review :-)
> >>>
> >>>Dave
> >>>
> >>>
> >>>On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
> >>>>Hi Dave,
> >>>>
> >>>>I guess this is a VM related issue. On 64-bit Spur
> >>>>[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >>>>returns '5,420,000 per second. 184 nanoseconds per run.'.
> >>>>
> >>>>Removing the primitive call from #digitCompare: the number goes up:
> >>>>'20,200,000 per second. 49.4 nanoseconds per run.'.
> >>>>
> >>>>You might think that it's okay because the JIT is so good. But we have
> >>>>another primitive to compare two bytes-objects. One which ought be
> >>>>slower than #primDigitCompare:, because it maps the bytes before doing
> >>>>the
> >>>>comparison. I even subtracted two from its result to match the result of
> >>>>#digitCompare:
> >>>>
> >>>>| order |
> >>>>order := (0 to: 255) as: ByteArray.
> >>>>[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >>>>order) - 2 ] bench.
> >>>>
> >>>>But it's still about twice as quick as #primDigitCompare:.
> >>>>'9,590,000 per second. 104 nanoseconds per run.'.
> >>>>So, something must be wrong with #primDigitCompare:.
> >>>>
> >>>>Levente
> >>>>
> >>>>On Wed, 13 Apr 2016, David T. Lewis wrote:
> >>>>
> >>>>>I would appreciate a review before moving this to trunk.
> >>>>>
> >>>>>Background:
> >>>>>
> >>>>>In UTCDateAndTime, most things are much faster than the trunk version.
> >>>>>However, DataAndTime equality check did not improve for 32-bit images,
> >>>>>so I ran it under AndreasSystemProfiler. Profiling showed that large
> >>>>>integer equality checks spends time mostly in primDigitCompare, which
> >>>>>is inefficient when only a simple byte comparison is needed.
> >>>>>
> >>>>>Here is the performance difference that I see on my system, 32-bit
> >>>>>trunk
> >>>>>Spur on Linux:
> >>>>>
> >>>>>| a b c d |
> >>>>>a := 7432154326465436. "a big number"
> >>>>>b := a + 1. "low order digit changed"
> >>>>>c := 8432154326465436. "high order digit changed"
> >>>>>d := 7432154026465436. "a digit in the middle changed"
> >>>>>
> >>>>>"Base performance in the trunk image"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
> >>>>>
> >>>>>"Performance after adding LargePositiveInteger>>="
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
> >>>>>
> >>>>>Dave
> >>>>>
> >>>>>
> >>>>>
> >>>>>On Wed, Apr 13, 2016 at 10:57:28PM +0000, [hidden email]
> >>>>>wrote:
> >>>>>>David T. Lewis uploaded a new version of Kernel to project The Inbox:
> >>>>>>http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
> >>>>>>
> >>>>>>==================== Summary ====================
> >>>>>>
> >>>>>>Name: Kernel-dtl.1015
> >>>>>>Author: dtl
> >>>>>>Time: 13 April 2016, 6:57:22.56608 pm
> >>>>>>UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e
> >>>>>>Ancestors: Kernel-mt.1014
> >>>>>>
> >>>>>>Make large integer equality test be about 13 times faster. Implement
> >>>>>>#=
> >>>>>>in LargePositiveInteger, and use digitAt: (primitive 60) for the
> >>>>>>comparison.
> >>>>>>
> >>>>>>=============== Diff against Kernel-mt.1014 ===============
> >>>>>>
> >>>>>>Item was added:
> >>>>>>+ ----- Method: LargePositiveInteger>>= (in category 'comparing')
> >>>>>>-----
> >>>>>>+ = aNumber
> >>>>>>+
> >>>>>>+       aNumber class == self class ifTrue: [
> >>>>>>+               aNumber size = self size ifFalse: [ ^false ].
> >>>>>>+               self size to: 1 by: -1 do: [ :i | (aNumber digitAt:
> >>>>>>i) =
> >>>>>>(self digitAt: i) ifFalse: [ ^ false ] ].
> >>>>>>+               ^ true ].
> >>>>>>+       aNumber isInteger ifTrue: [ ^false ].
> >>>>>>+       aNumber isNumber ifFalse: [ ^false ].
> >>>>>>+       ^aNumber adaptToInteger: self andCompare: #=!
> >>>>>>
> >>>>>
> >>>>>
> >>
> >>
> >
> >


Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Nicolas Cellier
 
Bingo,
here are the results without isKindOf checks:

[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
 '16,100,000 per second. 62.1 nanoseconds per run.'

that is more or less the speed of ByteArray comparison.

I think I will make a new pass on LargeIntegersPlugin :)

2016-04-16 22:54 GMT+02:00 Nicolas Cellier <[hidden email]>:
Ah but maybe the "Smart" side of the plugin is causing trouble:

    firstInteger := self
                primitive: 'primDigitCompare'
                parameters: #(#Integer )
                receiver: #Integer.

translates as:

        success(isKindOf(stackValue(0), "Integer"));
        secondInteger = stackValue(0);
        /* missing DebugCode */;
        success(isKindOf(stackValue(1), "Integer"));
        firstInteger = stackValue(1);

It might be faster to just check the 3 cases:

(interpreterProxy isIntegerObject: oop) or:
    [oopClass := interpreterProxy fetchClassOf: oop.
     oopClass == interpreterProxy classLargeNegativeInteger or: [oopClass == interpreterProxy classLargePositiveInteger]].

Moreover, we already test isIntegerObject further in the primitive code.
Since every LargeIntegersPlugin primitive is going thru this isKindOf check, there might be some low hanging fruits.

2016-04-16 19:37 GMT+02:00 Nicolas Cellier <[hidden email]>:
Hi,
the primitive does nothing very special
1) check for SmallInteger cases first, for quick return
2) check for LargeIntegers length then if both receiver & argument are Large
3) check for LargeIntegers digits then if both have same length

None of these 3 steps is expected to be slow.

A bleeding age VM does the 3rd step using 32 bits limbs instead of 8bits but this does not change a thing about performances ratio, it could only make a difference for giant integers, these ones fit on 56 bits...
I got these with 32bits Cog Sput MacOSX :

[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
 '8,980,000 per second. 111 nanoseconds per run.'

| order |
order := (0 to: 255) as: ByteArray.
[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench.
 '16,400,000 per second. 60.8 nanoseconds per run.'

Obviously, most time is spent elsewhere, it would be interesting to know where exactly.

Nicolas


2016-04-16 18:52 GMT+02:00 David T. Lewis <[hidden email]>:

It does seem odd that it would be different on Spur. You suggested that
compiler optimization might be set differently for the two plugins. I
cannot check (I'm on an Ubuntu laptop, cannot build Spur due to some
kind of autotools problem that I can't figure out right now), but if you
are able to do a VM compile and save the console output to a file, then
I think you can look at the CFLAGS setting for the two plugins ($ nohup ./mvm,
but first comment out the part of mvm that asks if you want to clean).

Aside from that, the actual generated code for primDigitCompare is
doing quite a lot of stuff before it ever gets around to comparing the
digits. I suspect it is not a very efficient primitive.

Dave


On Fri, Apr 15, 2016 at 05:52:02PM +0200, Levente Uzonyi wrote:
>
> This thread got derailed, and I feel like we should go back to its
> main point, which was that #digitCompare: using #primitiveDigitCompare is
> slow.
> It's so slow, that #primitiveCompareString is about twice as quick in
> Spur VMs, which is odd.
>
> Levente
>
> On Thu, 14 Apr 2016, Levente Uzonyi wrote:
>
> >Hi Dave,
> >
> >I dag a bit deeper into this problem, and I found that it's been around
> >for ages. I compared the two primitives in three older images.
> >The first one is Squeak 4.2 running on Cog r2714:
> >
> >[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >'6,880,000 per second.'
> >
> >| order |
> >order := (0 to: 255) as: ByteArray.
> >[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >order) - 2 ] bench.
> >'11,200,000 per second.'
> >
> >The next one was an even older Cog VM running an image updated from
> >3.10.2. The VM didn't have the primitives required to fetch the VM
> >information. The result were '8.22459348130374e6 per second.' and
> >'1.19677724910036e7 per second.', respectively.
> >
> >The last image was a closure-enabled 3.10.2 running on the classic
> >Interpreter VM. The results were '3.911442911417717e6 per second.' and
> >'4.84567866426715e6 per second.', respectively.
> >
> >Since in all VMs the seemingly more complex code (primitiveCompareString
> >with a subtraction) was quicker than the simpler code
> >(primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled
> >with less aggressive optimization than MiscPrimitivePlugin.
> >
> >What's also interesting is that, based on these benchmarks, the
> >performance got worse over the years. I think invoking a primitive has its
> >cost in newer VMs.
> >
> >Levente
> >
> >On Thu, 14 Apr 2016, David T. Lewis wrote:
> >
> >>Oops, I just realized that my "interpreter VM" results were from a
> >>debugging
> >>VM with compiler optimization off (too many VMs, sorry). Here are the
> >>results
> >>for a more representative interpreter VM. I can't explain the variation,
> >>but
> >>one conclusion is clear: my suggested "optimization" in the inbox is a
> >>bad idea
> >>(and I will move it to the treated inbox).
> >>
> >> "Base performance in the trunk level V3 image with interpreter VM"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
> >>
> >> "Performance after adding LargePositiveInteger>>="
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]].  "==> 27824"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
> >>
> >>Dave
> >>
> >>
> >>
> >>On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
> >>>Hi Levente,
> >>>
> >>>I think you may be right. I repeated my test with an interpreter VM on a
> >>>32-bit image (with loop count smaller because the interpreter VM is
> >>>slower
> >>>than Spur). The change that I put in the inbox does not have any benefit
> >>>for the interpreter VM:
> >>>
> >>>
> >>>  a := 7432154326465436. "a big number"
> >>>  b := a + 1. "low order digit changed"
> >>>  c := 8432154326465436. "high order digit changed"
> >>>  d := 7432154026465436. "a digit in the middle changed"
> >>>
> >>>   "Base performance in the trunk level V3 image with interpreter VM"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
> >>>
> >>>  "Performance after adding LargePositiveInteger>>="
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
> >>>
> >>>So yes it is something related to the VM. But I do not understand
> >>>how #primDigitCompare could be so slow? As you say, maybe something
> >>>is wrong with it.
> >>>
> >>>Thank you, I am glad that I asked for a review :-)
> >>>
> >>>Dave
> >>>
> >>>
> >>>On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
> >>>>Hi Dave,
> >>>>
> >>>>I guess this is a VM related issue. On 64-bit Spur
> >>>>[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >>>>returns '5,420,000 per second. 184 nanoseconds per run.'.
> >>>>
> >>>>Removing the primitive call from #digitCompare: the number goes up:
> >>>>'20,200,000 per second. 49.4 nanoseconds per run.'.
> >>>>
> >>>>You might think that it's okay because the JIT is so good. But we have
> >>>>another primitive to compare two bytes-objects. One which ought be
> >>>>slower than #primDigitCompare:, because it maps the bytes before doing
> >>>>the
> >>>>comparison. I even subtracted two from its result to match the result of
> >>>>#digitCompare:
> >>>>
> >>>>| order |
> >>>>order := (0 to: 255) as: ByteArray.
> >>>>[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >>>>order) - 2 ] bench.
> >>>>
> >>>>But it's still about twice as quick as #primDigitCompare:.
> >>>>'9,590,000 per second. 104 nanoseconds per run.'.
> >>>>So, something must be wrong with #primDigitCompare:.
> >>>>
> >>>>Levente
> >>>>
> >>>>On Wed, 13 Apr 2016, David T. Lewis wrote:
> >>>>
> >>>>>I would appreciate a review before moving this to trunk.
> >>>>>
> >>>>>Background:
> >>>>>
> >>>>>In UTCDateAndTime, most things are much faster than the trunk version.
> >>>>>However, DataAndTime equality check did not improve for 32-bit images,
> >>>>>so I ran it under AndreasSystemProfiler. Profiling showed that large
> >>>>>integer equality checks spends time mostly in primDigitCompare, which
> >>>>>is inefficient when only a simple byte comparison is needed.
> >>>>>
> >>>>>Here is the performance difference that I see on my system, 32-bit
> >>>>>trunk
> >>>>>Spur on Linux:
> >>>>>
> >>>>>| a b c d |
> >>>>>a := 7432154326465436. "a big number"
> >>>>>b := a + 1. "low order digit changed"
> >>>>>c := 8432154326465436. "high order digit changed"
> >>>>>d := 7432154026465436. "a digit in the middle changed"
> >>>>>
> >>>>>"Base performance in the trunk image"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
> >>>>>
> >>>>>"Performance after adding LargePositiveInteger>>="
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
> >>>>>
> >>>>>Dave
> >>>>>
> >>>>>
> >>>>>
> >>>>>On Wed, Apr 13, 2016 at 10:57:28PM +0000, [hidden email]
> >>>>>wrote:
> >>>>>>David T. Lewis uploaded a new version of Kernel to project The Inbox:
> >>>>>>http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
> >>>>>>
> >>>>>>==================== Summary ====================
> >>>>>>
> >>>>>>Name: Kernel-dtl.1015
> >>>>>>Author: dtl
> >>>>>>Time: 13 April 2016, 6:57:22.56608 pm
> >>>>>>UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e
> >>>>>>Ancestors: Kernel-mt.1014
> >>>>>>
> >>>>>>Make large integer equality test be about 13 times faster. Implement
> >>>>>>#=
> >>>>>>in LargePositiveInteger, and use digitAt: (primitive 60) for the
> >>>>>>comparison.
> >>>>>>
> >>>>>>=============== Diff against Kernel-mt.1014 ===============
> >>>>>>
> >>>>>>Item was added:
> >>>>>>+ ----- Method: LargePositiveInteger>>= (in category 'comparing')
> >>>>>>-----
> >>>>>>+ = aNumber
> >>>>>>+
> >>>>>>+       aNumber class == self class ifTrue: [
> >>>>>>+               aNumber size = self size ifFalse: [ ^false ].
> >>>>>>+               self size to: 1 by: -1 do: [ :i | (aNumber digitAt:
> >>>>>>i) =
> >>>>>>(self digitAt: i) ifFalse: [ ^ false ] ].
> >>>>>>+               ^ true ].
> >>>>>>+       aNumber isInteger ifTrue: [ ^false ].
> >>>>>>+       aNumber isNumber ifFalse: [ ^false ].
> >>>>>>+       ^aNumber adaptToInteger: self andCompare: #=!
> >>>>>>
> >>>>>
> >>>>>
> >>
> >>
> >
> >



Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Eliot Miranda-2
 
Hi Nicolas,

   IMO we need a different API for checking classes, something that via macros compiles to direct class access for internal plugins.

I was also going to suggest using the VM Profiler:

"The VMProfiler will tell you exactly where:
then screen menu open... VM Profiler
then put an expression in the bottom right hand box, and hit  Profile,
then in the left=hand list of function names, select VM report"

But the UnixOSProcessPlugin as compiled by my mac build scripts and run under Spur on Mac OS X does not appear to load.  The VM Profiler depends on OS Process to extract addresses from shared libraries.  I'm looking at it now :-(


On Sat, Apr 16, 2016 at 2:44 PM, Nicolas Cellier <[hidden email]> wrote:
 
Bingo,
here are the results without isKindOf checks:

[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
 '16,100,000 per second. 62.1 nanoseconds per run.'

that is more or less the speed of ByteArray comparison.

I think I will make a new pass on LargeIntegersPlugin :)

2016-04-16 22:54 GMT+02:00 Nicolas Cellier <[hidden email]>:
Ah but maybe the "Smart" side of the plugin is causing trouble:

    firstInteger := self
                primitive: 'primDigitCompare'
                parameters: #(#Integer )
                receiver: #Integer.

translates as:

        success(isKindOf(stackValue(0), "Integer"));
        secondInteger = stackValue(0);
        /* missing DebugCode */;
        success(isKindOf(stackValue(1), "Integer"));
        firstInteger = stackValue(1);

It might be faster to just check the 3 cases:

(interpreterProxy isIntegerObject: oop) or:
    [oopClass := interpreterProxy fetchClassOf: oop.
     oopClass == interpreterProxy classLargeNegativeInteger or: [oopClass == interpreterProxy classLargePositiveInteger]].

Moreover, we already test isIntegerObject further in the primitive code.
Since every LargeIntegersPlugin primitive is going thru this isKindOf check, there might be some low hanging fruits.

2016-04-16 19:37 GMT+02:00 Nicolas Cellier <[hidden email]>:
Hi,
the primitive does nothing very special
1) check for SmallInteger cases first, for quick return
2) check for LargeIntegers length then if both receiver & argument are Large
3) check for LargeIntegers digits then if both have same length

None of these 3 steps is expected to be slow.

A bleeding age VM does the 3rd step using 32 bits limbs instead of 8bits but this does not change a thing about performances ratio, it could only make a difference for giant integers, these ones fit on 56 bits...
I got these with 32bits Cog Sput MacOSX :

[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
 '8,980,000 per second. 111 nanoseconds per run.'

| order |
order := (0 to: 255) as: ByteArray.
[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated: order) - 2 ] bench.
 '16,400,000 per second. 60.8 nanoseconds per run.'

Obviously, most time is spent elsewhere, it would be interesting to know where exactly.

Nicolas


2016-04-16 18:52 GMT+02:00 David T. Lewis <[hidden email]>:

It does seem odd that it would be different on Spur. You suggested that
compiler optimization might be set differently for the two plugins. I
cannot check (I'm on an Ubuntu laptop, cannot build Spur due to some
kind of autotools problem that I can't figure out right now), but if you
are able to do a VM compile and save the console output to a file, then
I think you can look at the CFLAGS setting for the two plugins ($ nohup ./mvm,
but first comment out the part of mvm that asks if you want to clean).

Aside from that, the actual generated code for primDigitCompare is
doing quite a lot of stuff before it ever gets around to comparing the
digits. I suspect it is not a very efficient primitive.

Dave


On Fri, Apr 15, 2016 at 05:52:02PM +0200, Levente Uzonyi wrote:
>
> This thread got derailed, and I feel like we should go back to its
> main point, which was that #digitCompare: using #primitiveDigitCompare is
> slow.
> It's so slow, that #primitiveCompareString is about twice as quick in
> Spur VMs, which is odd.
>
> Levente
>
> On Thu, 14 Apr 2016, Levente Uzonyi wrote:
>
> >Hi Dave,
> >
> >I dag a bit deeper into this problem, and I found that it's been around
> >for ages. I compared the two primitives in three older images.
> >The first one is Squeak 4.2 running on Cog r2714:
> >
> >[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >'6,880,000 per second.'
> >
> >| order |
> >order := (0 to: 255) as: ByteArray.
> >[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >order) - 2 ] bench.
> >'11,200,000 per second.'
> >
> >The next one was an even older Cog VM running an image updated from
> >3.10.2. The VM didn't have the primitives required to fetch the VM
> >information. The result were '8.22459348130374e6 per second.' and
> >'1.19677724910036e7 per second.', respectively.
> >
> >The last image was a closure-enabled 3.10.2 running on the classic
> >Interpreter VM. The results were '3.911442911417717e6 per second.' and
> >'4.84567866426715e6 per second.', respectively.
> >
> >Since in all VMs the seemingly more complex code (primitiveCompareString
> >with a subtraction) was quicker than the simpler code
> >(primitiveDigitCompare), I suspect that LargeIntegersPlugin is compiled
> >with less aggressive optimization than MiscPrimitivePlugin.
> >
> >What's also interesting is that, based on these benchmarks, the
> >performance got worse over the years. I think invoking a primitive has its
> >cost in newer VMs.
> >
> >Levente
> >
> >On Thu, 14 Apr 2016, David T. Lewis wrote:
> >
> >>Oops, I just realized that my "interpreter VM" results were from a
> >>debugging
> >>VM with compiler optimization off (too many VMs, sorry). Here are the
> >>results
> >>for a more representative interpreter VM. I can't explain the variation,
> >>but
> >>one conclusion is clear: my suggested "optimization" in the inbox is a
> >>bad idea
> >>(and I will move it to the treated inbox).
> >>
> >> "Base performance in the trunk level V3 image with interpreter VM"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 26669"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]]. "==> 25052"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 25275"
> >>
> >> "Performance after adding LargePositiveInteger>>="
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = b ]]. "==> 59224"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = c ]].  "==> 27824"
> >> Time millisecondsToRun: [100000000 timesRepeat: [ a = d ]]. "==> 44324"
> >>
> >>Dave
> >>
> >>
> >>
> >>On Wed, Apr 13, 2016 at 08:25:33PM -0400, David T. Lewis wrote:
> >>>Hi Levente,
> >>>
> >>>I think you may be right. I repeated my test with an interpreter VM on a
> >>>32-bit image (with loop count smaller because the interpreter VM is
> >>>slower
> >>>than Spur). The change that I put in the inbox does not have any benefit
> >>>for the interpreter VM:
> >>>
> >>>
> >>>  a := 7432154326465436. "a big number"
> >>>  b := a + 1. "low order digit changed"
> >>>  c := 8432154326465436. "high order digit changed"
> >>>  d := 7432154026465436. "a digit in the middle changed"
> >>>
> >>>   "Base performance in the trunk level V3 image with interpreter VM"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3844"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3786"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]].. "==> 3800"
> >>>
> >>>  "Performance after adding LargePositiveInteger>>="
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = b ]]. "==> 3868"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = c ]]. "==> 3775"
> >>>  Time millisecondsToRun: [5000000 timesRepeat: [ a = d ]]. "==> 3770"
> >>>
> >>>So yes it is something related to the VM. But I do not understand
> >>>how #primDigitCompare could be so slow? As you say, maybe something
> >>>is wrong with it.
> >>>
> >>>Thank you, I am glad that I asked for a review :-)
> >>>
> >>>Dave
> >>>
> >>>
> >>>On Thu, Apr 14, 2016 at 01:45:42AM +0200, Levente Uzonyi wrote:
> >>>>Hi Dave,
> >>>>
> >>>>I guess this is a VM related issue. On 64-bit Spur
> >>>>[ 7432154326465436 digitCompare: 8432154326465436 ] bench.
> >>>>returns '5,420,000 per second. 184 nanoseconds per run.'.
> >>>>
> >>>>Removing the primitive call from #digitCompare: the number goes up:
> >>>>'20,200,000 per second. 49.4 nanoseconds per run.'.
> >>>>
> >>>>You might think that it's okay because the JIT is so good. But we have
> >>>>another primitive to compare two bytes-objects. One which ought be
> >>>>slower than #primDigitCompare:, because it maps the bytes before doing
> >>>>the
> >>>>comparison. I even subtracted two from its result to match the result of
> >>>>#digitCompare:
> >>>>
> >>>>| order |
> >>>>order := (0 to: 255) as: ByteArray.
> >>>>[ (ByteString compare: 7432154326465436 with: 8432154326465436 collated:
> >>>>order) - 2 ] bench.
> >>>>
> >>>>But it's still about twice as quick as #primDigitCompare:.
> >>>>'9,590,000 per second. 104 nanoseconds per run.'.
> >>>>So, something must be wrong with #primDigitCompare:.
> >>>>
> >>>>Levente
> >>>>
> >>>>On Wed, 13 Apr 2016, David T. Lewis wrote:
> >>>>
> >>>>>I would appreciate a review before moving this to trunk.
> >>>>>
> >>>>>Background:
> >>>>>
> >>>>>In UTCDateAndTime, most things are much faster than the trunk version.
> >>>>>However, DataAndTime equality check did not improve for 32-bit images,
> >>>>>so I ran it under AndreasSystemProfiler. Profiling showed that large
> >>>>>integer equality checks spends time mostly in primDigitCompare, which
> >>>>>is inefficient when only a simple byte comparison is needed.
> >>>>>
> >>>>>Here is the performance difference that I see on my system, 32-bit
> >>>>>trunk
> >>>>>Spur on Linux:
> >>>>>
> >>>>>| a b c d |
> >>>>>a := 7432154326465436. "a big number"
> >>>>>b := a + 1. "low order digit changed"
> >>>>>c := 8432154326465436. "high order digit changed"
> >>>>>d := 7432154026465436. "a digit in the middle changed"
> >>>>>
> >>>>>"Base performance in the trunk image"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 63733"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 63152"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 63581"
> >>>>>
> >>>>>"Performance after adding LargePositiveInteger>>="
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = b ]]. "==> 4676"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = c ]]. "==> 4883"
> >>>>>Time millisecondsToRun: [500000000 timesRepeat: [ a = d ]]. "==> 4512"
> >>>>>
> >>>>>Dave
> >>>>>
> >>>>>
> >>>>>
> >>>>>On Wed, Apr 13, 2016 at 10:57:28PM +0000, [hidden email]
> >>>>>wrote:
> >>>>>>David T. Lewis uploaded a new version of Kernel to project The Inbox:
> >>>>>>http://source.squeak.org/inbox/Kernel-dtl.1015.mcz
> >>>>>>
> >>>>>>==================== Summary ====================
> >>>>>>
> >>>>>>Name: Kernel-dtl.1015
> >>>>>>Author: dtl
> >>>>>>Time: 13 April 2016, 6:57:22.56608 pm
> >>>>>>UUID: bd849f91-9b00-45c5-b2ab-891b420bde5e
> >>>>>>Ancestors: Kernel-mt.1014
> >>>>>>
> >>>>>>Make large integer equality test be about 13 times faster. Implement
> >>>>>>#=
> >>>>>>in LargePositiveInteger, and use digitAt: (primitive 60) for the
> >>>>>>comparison.
> >>>>>>
> >>>>>>=============== Diff against Kernel-mt.1014 ===============
> >>>>>>
> >>>>>>Item was added:
> >>>>>>+ ----- Method: LargePositiveInteger>>= (in category 'comparing')
> >>>>>>-----
> >>>>>>+ = aNumber
> >>>>>>+
> >>>>>>+       aNumber class == self class ifTrue: [
> >>>>>>+               aNumber size = self size ifFalse: [ ^false ].
> >>>>>>+               self size to: 1 by: -1 do: [ :i | (aNumber digitAt:
> >>>>>>i) =
> >>>>>>(self digitAt: i) ifFalse: [ ^ false ] ].
> >>>>>>+               ^ true ].
> >>>>>>+       aNumber isInteger ifTrue: [ ^false ].
> >>>>>>+       aNumber isNumber ifFalse: [ ^false ].
> >>>>>>+       ^aNumber adaptToInteger: self andCompare: #=!
> >>>>>>
> >>>>>
> >>>>>
> >>
> >>
> >
> >







--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Levente Uzonyi
In reply to this post by Nicolas Cellier
 
I just played around a bit with the code generator and I think there other
reasons why the code is slow:
- the use of function calls instead of macros for isIntegerObject,
slotSizeOf, success, failed, stackValue. I suspect that the VM internally
uses macros for these in numbered primitives.
- #cDigitCompare:with:len: doesn't get inlined, even though it's a tail
call. That's probably a limitation of the Slang inliner, or a bug.

Some functions are generated as static, even though they are never called,
because they have been inlined during the code generation. E.g.:
#digitCompareLarge:with:.
While this doesn't have a direct effect on execution time, it may affect
the CPU cache.
I think the best would be to not generate any C code from such methods.

Levente
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Nicolas Cellier
In reply to this post by Eliot Miranda-2
 


2016-04-16 23:54 GMT+02:00 Eliot Miranda <[hidden email]>:
 
Hi Nicolas,

   IMO we need a different API for checking classes, something that via macros compiles to direct class access for internal plugins.

I was also going to suggest using the VM Profiler:

"The VMProfiler will tell you exactly where:
then screen menu open... VM Profiler
then put an expression in the bottom right hand box, and hit  Profile,
then in the left=hand list of function names, select VM report"

But the UnixOSProcessPlugin as compiled by my mac build scripts and run under Spur on Mac OS X does not appear to load.  The VM Profiler depends on OS Process to extract addresses from shared libraries.  I'm looking at it now :-(


Great, one more tool to look at :)
 
--
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Levente Uzonyi
In reply to this post by Levente Uzonyi
 
On Sun, 17 Apr 2016, Levente Uzonyi wrote:

>
> I just played around a bit with the code generator and I think there other
> reasons why the code is slow:
> - the use of function calls instead of macros for isIntegerObject,
> slotSizeOf, success, failed, stackValue. I suspect that the VM internally
> uses macros for these in numbered primitives.

Perhaps gcc can automatically inline these with the -flto switch.

Levente

> - #cDigitCompare:with:len: doesn't get inlined, even though it's a tail call.
> That's probably a limitation of the Slang inliner, or a bug.
>
> Some functions are generated as static, even though they are never called,
> because they have been inlined during the code generation. E.g.:
> #digitCompareLarge:with:.
> While this doesn't have a direct effect on execution time, it may affect the
> CPU cache.
> I think the best would be to not generate any C code from such methods.
>
> Levente
>
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Nicolas Cellier
In reply to this post by Levente Uzonyi
 


2016-04-17 0:36 GMT+02:00 Levente Uzonyi <[hidden email]>:

I just played around a bit with the code generator and I think there other reasons why the code is slow:
- the use of function calls instead of macros for isIntegerObject, slotSizeOf, success, failed, stackValue. I suspect that the VM internally uses macros for these in numbered primitives.


Hi Levente,
But currently we generate single code source for plugins whatever VM flavour, so the macro if any should be defined in VM specific repository... That sounds a bit like what Eliot was suggesting for fetching the class though (if I understood correctly).

My opinion is that suppressing function call overhead will make little difference since these calls  are pre-conditions sent once or twice per primitive.
 
- #cDigitCompare:with:len: doesn't get inlined, even though it's a tail call. That's probably a limitation of the Slang inliner, or a bug.


I'm working on Slang type inference and inlining (they are closely bound), but I don't think it really matters for this specific call. More so for unsafeDigitAt & co which might be used in tight loops.
 
Some functions are generated as static, even though they are never called, because they have been inlined during the code generation. E.g.: #digitCompareLarge:with:.
While this doesn't have a direct effect on execution time, it may affect the CPU cache.
I think the best would be to not generate any C code from such methods.


I wonder if I did not see code for eliminating these... I'll check again.
 
Levente

12