Smalltalk › Squeak › Squeak - Dev

Floating point performance

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

3 messages Options

David Faught

Floating point performance

John M McIntosh wrote:
>Could you share your messageTally. If you are using floatarray logic
>then most of the math is done in the plugin. However
>the plugin does not take advantage of any vector processing hardware
>you might have so there is room for improvement.

The MessageTally output is below. Maybe "almost 80% of the time was
spent in basic floating point array operations" is a little
exaggerated, but not a lot. What vector processing hardware? The
only thing I know of would be trying to use the video card GPU, which
could be lots of fun!

>Also if you have say a+b*c-d in smalltalk where these are float
>array objevcts that would three primitive interactions, converting
>that to slang would provide some performance improvements.

I'm not sure I understand this statement. Is there enough overhead in
the plugin API to justify eliminating a couple of calls, or is there
some data representation conversion involved that could be avoided?

I haven't read Andrew Greenberg's chapter on "Extending the Squeak
Virtual Machine" in detail yet. I kind of skimmed over the sections
"The Shape of a Smalltalk Object" and "The Anatomy of a Named
Primitive", which I'm sure is where all the good stuff is. Are you
saying that some performance improvement in your sample expression
could be gained by just coding it in Slang, without translating and
compiling it, or have I gone one step too far?

- 2441 tallies, 39083 msec.

**Tree**
100.0% {39083ms} TClothOxe>>pulse
77.8% {30407ms} TClothOxe>>constrain
|77.8% {30407ms} TClothOxe>>constrain:
| 14.2% {5550ms} B3DVector3(FloatArray)>>*
| 13.9% {5433ms} B3DVector3(FloatArray)>>-
| 12.2% {4768ms} B3DVector3Array>>at:
| 9.7% {3791ms} TClothOxe>>collide
| |9.7% {3791ms} TClothOxe>>collideSphere:
| | 3.6% {1407ms} B3DVector3(FloatArray)>>length
| | 3.0% {1172ms} B3DVector3(FloatArray)>>-
| | 2.9% {1133ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
| | 2.9% {1133ms}
B3DVector3Array(SequenceableCollection)>>withIndexDo:
| 8.8% {3439ms} B3DVector3(FloatArray)>>+
| 6.3% {2462ms} B3DVector3Array>>at:put:
| 5.8% {2267ms} TClothOxe>>constrainGround
| |3.2% {1251ms} B3DVector3Array(B3DInplaceArray)>>do:
| |2.6% {1016ms} B3DVector3>>y
| 3.8% {1485ms} OrderedCollection>>do:
| 2.8% {1094ms} primitives
7.0% {2736ms} B3DVector3Array(SequenceableCollection)>>replaceFrom:to:with:
|7.0% {2736ms}
B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt:
| 2.7% {1055ms} B3DVector3Array>>at:put:
| 2.5% {977ms} B3DVector3Array>>at:
4.4% {1720ms} Float>>*
|2.4% {938ms} B3DVector3(Object)>>adaptToFloat:andSend:
|2.0% {782ms} primitives
3.2% {1251ms} B3DVector3(FloatArray)>>-
2.3% {899ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
2.3% {899ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:

**Leaves**
20.1% {7856ms} B3DVector3(FloatArray)>>-
19.8% {7738ms} B3DVector3Array>>at:
15.9% {6214ms} B3DVector3(FloatArray)>>*
11.8% {4612ms} B3DVector3Array>>at:put:
10.9% {4260ms} B3DVector3(FloatArray)>>+
3.8% {1485ms} OrderedCollection>>do:
2.8% {1094ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
2.8% {1094ms} TClothOxe>>constrain:
2.6% {1016ms} B3DVector3>>y
2.0% {782ms} Float>>*

**Memory**
old +386,532 bytes
young -551,924 bytes
used -165,392 bytes
free +165,392 bytes

**GCs**
full 0 totalling 0ms (0.0% uptime)
incr 7133 totalling 1,326ms (3.0% uptime), avg 0.0ms
tenures 1 (avg 7133 GCs/tenure)
root table 0 overflows

Joshua Gargus-2

Re: Floating point performance

On Dec 13, 2006, at 4:33 PM, David Faught wrote:

> John M McIntosh wrote:
>> Could you share your messageTally. If you are using floatarray logic
>> then most of the math is done in the plugin. However
>> the plugin does not take advantage of any vector processing hardware
>> you might have so there is room for improvement.
>
> The MessageTally output is below. Maybe "almost 80% of the time was
> spent in basic floating point array operations" is a little
> exaggerated, but not a lot.
> What vector processing hardware?

Like SSE or MMX on Intel, or Altivec on PowerPC.

> The
> only thing I know of would be trying to use the video card GPU, which
> could be lots of fun!
>
>> Also if you have say a+b*c-d in smalltalk where these are float
>> array objevcts that would three primitive interactions, converting
>> that to slang would provide some performance improvements.
>
> I'm not sure I understand this statement. Is there enough overhead in
> the plugin API to justify eliminating a couple of calls, or is there
> some data representation conversion involved that could be avoided?
>
> I haven't read Andrew Greenberg's chapter on "Extending the Squeak
> Virtual Machine" in detail yet. I kind of skimmed over the sections
> "The Shape of a Smalltalk Object" and "The Anatomy of a Named
> Primitive", which I'm sure is where all the good stuff is. Are you
> saying that some performance improvement in your sample expression
> could be gained by just coding it in Slang, without translating and
> compiling it, or have I gone one step too far?
>

I see about 43% on float array arithmetic and another 4% on regular
float arithmetic. The float array arithmetic is being done on very
short arrays (3 elements), so the call overhead might be significant
(i.e. John's suggestion re: a+b*c-d might pay dividends); if you're
dealing with 10000-element float arrays, then the call overhead is
negligible.

The big problem seems to be that you're spending a lot of time
unpacking and repacking B3DVector3Arrays. If you write a primitive
to do the computation, then you can avoid all of the #at: and
#at:put: overhead, the overhead for allocating and garbage collecting
all of the intermediate B3DVectors, and much of the call overhead for
*, +, and -.

The following experiment might be helpful:
a := Vector3Array new: 1000.
b := Vector3 new.
[1000 timesRepeat: [a + a]] timeToRun. "18ms"
[1000 timesRepeat: [a += a]] timeToRun. "5ms"
[1000000 timesRepeat: [b + b]] timeToRun. "481ms"
[1000000 timesRepeat: [b += b]] timeToRun. "258ms"

In all cases we're adding a million pairs of vect3s. The first and
third are slower than the second and fourth due to the allocation of
a new target array each time; this gives you an idea of the
achievable gains if you can restructure you algorithms to re-use an
intermediate target array. The first two are faster than the last
two because they're not doing as much work in Squeak: less iterations
and fewer primitive calls. Note that these last two still don't
involve unpacking and packing arrays. Your code seems to be doing
something like:

[1000 timesRepeat: [a do: [:aa | aa + aa]]] timeToRun. "1346ms"
or perhaps
[1000 timesRepeat: [1 to: 1000 do: [:i | (a at: i ) + (a at: i)]]]
timeToRun "1286ms"

In short, it looks like there is a lot of room for improvement.

Josh

>
> - 2441 tallies, 39083 msec.
>
> **Tree**
> 100.0% {39083ms} TClothOxe>>pulse
> 77.8% {30407ms} TClothOxe>>constrain
> |77.8% {30407ms} TClothOxe>>constrain:
> | 14.2% {5550ms} B3DVector3(FloatArray)>>*
> | 13.9% {5433ms} B3DVector3(FloatArray)>>-
> | 12.2% {4768ms} B3DVector3Array>>at:
> | 9.7% {3791ms} TClothOxe>>collide
> | |9.7% {3791ms} TClothOxe>>collideSphere:
> | | 3.6% {1407ms} B3DVector3(FloatArray)>>length
> | | 3.0% {1172ms} B3DVector3(FloatArray)>>-
> | | 2.9% {1133ms} B3DVector3Array(SequenceableCollection)
> >>doWithIndex:
> | | 2.9% {1133ms}
> B3DVector3Array(SequenceableCollection)>>withIndexDo:
> | 8.8% {3439ms} B3DVector3(FloatArray)>>+
> | 6.3% {2462ms} B3DVector3Array>>at:put:
> | 5.8% {2267ms} TClothOxe>>constrainGround
> | |3.2% {1251ms} B3DVector3Array(B3DInplaceArray)>>do:
> | |2.6% {1016ms} B3DVector3>>y
> | 3.8% {1485ms} OrderedCollection>>do:
> | 2.8% {1094ms} primitives
> 7.0% {2736ms} B3DVector3Array(SequenceableCollection)
> >>replaceFrom:to:with:
> |7.0% {2736ms}
> B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt:
> | 2.7% {1055ms} B3DVector3Array>>at:put:
> | 2.5% {977ms} B3DVector3Array>>at:
> 4.4% {1720ms} Float>>*
> |2.4% {938ms} B3DVector3(Object)>>adaptToFloat:andSend:
> |2.0% {782ms} primitives
> 3.2% {1251ms} B3DVector3(FloatArray)>>-
> 2.3% {899ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
> 2.3% {899ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
> **Leaves**
> 20.1% {7856ms} B3DVector3(FloatArray)>>-
> 19.8% {7738ms} B3DVector3Array>>at:
> 15.9% {6214ms} B3DVector3(FloatArray)>>*
> 11.8% {4612ms} B3DVector3Array>>at:put:
> 10.9% {4260ms} B3DVector3(FloatArray)>>+
> 3.8% {1485ms} OrderedCollection>>do:
> 2.8% {1094ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
> 2.8% {1094ms} TClothOxe>>constrain:
> 2.6% {1016ms} B3DVector3>>y
> 2.0% {782ms} Float>>*
>
> **Memory**
> old +386,532 bytes
> young -551,924 bytes
> used -165,392 bytes
> free +165,392 bytes
>
> **GCs**
> full 0 totalling 0ms (0.0% uptime)
> incr 7133 totalling 1,326ms (3.0% uptime), avg 0.0ms
> tenures 1 (avg 7133 GCs/tenure)
> root table 0 overflows
>

news.gmane.org-2

Re: Floating point performance

In reply to this post by David Faught

David Faught <dave.faught <at> gmail.com> writes:

> The MessageTally output is below. Maybe "almost 80% of the time was
> spent in basic floating point array operations" is a little
> exaggerated, but not a lot. What vector processing hardware?

If you link up to the Intel MKL (Maths Kernel Library), at least for Intel
chips, I believe you will take advantage of any special instructions. MKL has
the same API as LAPACK or BLAS.

> The
> only thing I know of would be trying to use the video card GPU, which
> could be lots of fun!

BTW has anybody ported Squeak to a GPU? Have you seen how folding@home is
leveraging GPUs?

http://folding.stanford.edu/FAQ-ATI.html