Floating point performance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Floating point performance

David Faught
 John M McIntosh wrote:
>Could you share your messageTally. If you are using floatarray logic
>then most of the math is done in the plugin. However
>the plugin does not take advantage of any vector processing hardware
>you might have so there is room for improvement.

The MessageTally output is below.  Maybe "almost 80% of the time was
spent in basic floating point array operations" is a little
exaggerated, but not a lot.  What vector processing hardware?  The
only thing I know of would be trying to use the video card GPU, which
could be lots of fun!

>Also if  you have say a+b*c-d  in smalltalk where these are float
>array objevcts that would three primitive  interactions, converting
>that to slang would provide some performance improvements.

I'm not sure I understand this statement.  Is there enough overhead in
the plugin API to justify eliminating a couple of calls, or is there
some data representation conversion involved that could be avoided?

I haven't read Andrew Greenberg's chapter on "Extending the Squeak
Virtual Machine" in detail yet.  I kind of skimmed over the sections
"The Shape of a Smalltalk Object" and "The Anatomy of a Named
Primitive", which I'm sure is where all the good stuff is.  Are you
saying that some performance improvement in your sample expression
could be gained by just coding it in Slang, without translating and
compiling it, or have I gone one step too far?


- 2441 tallies, 39083 msec.

**Tree**
100.0% {39083ms} TClothOxe>>pulse
  77.8% {30407ms} TClothOxe>>constrain
    |77.8% {30407ms} TClothOxe>>constrain:
    |  14.2% {5550ms} B3DVector3(FloatArray)>>*
    |  13.9% {5433ms} B3DVector3(FloatArray)>>-
    |  12.2% {4768ms} B3DVector3Array>>at:
    |  9.7% {3791ms} TClothOxe>>collide
    |    |9.7% {3791ms} TClothOxe>>collideSphere:
    |    |  3.6% {1407ms} B3DVector3(FloatArray)>>length
    |    |  3.0% {1172ms} B3DVector3(FloatArray)>>-
    |    |  2.9% {1133ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
    |    |    2.9% {1133ms}
B3DVector3Array(SequenceableCollection)>>withIndexDo:
    |  8.8% {3439ms} B3DVector3(FloatArray)>>+
    |  6.3% {2462ms} B3DVector3Array>>at:put:
    |  5.8% {2267ms} TClothOxe>>constrainGround
    |    |3.2% {1251ms} B3DVector3Array(B3DInplaceArray)>>do:
    |    |2.6% {1016ms} B3DVector3>>y
    |  3.8% {1485ms} OrderedCollection>>do:
    |  2.8% {1094ms} primitives
  7.0% {2736ms} B3DVector3Array(SequenceableCollection)>>replaceFrom:to:with:
    |7.0% {2736ms}
B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt:
    |  2.7% {1055ms} B3DVector3Array>>at:put:
    |  2.5% {977ms} B3DVector3Array>>at:
  4.4% {1720ms} Float>>*
    |2.4% {938ms} B3DVector3(Object)>>adaptToFloat:andSend:
    |2.0% {782ms} primitives
  3.2% {1251ms} B3DVector3(FloatArray)>>-
  2.3% {899ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
    2.3% {899ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:

**Leaves**
20.1% {7856ms} B3DVector3(FloatArray)>>-
19.8% {7738ms} B3DVector3Array>>at:
15.9% {6214ms} B3DVector3(FloatArray)>>*
11.8% {4612ms} B3DVector3Array>>at:put:
10.9% {4260ms} B3DVector3(FloatArray)>>+
3.8% {1485ms} OrderedCollection>>do:
2.8% {1094ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
2.8% {1094ms} TClothOxe>>constrain:
2.6% {1016ms} B3DVector3>>y
2.0% {782ms} Float>>*

**Memory**
        old +386,532 bytes
        young -551,924 bytes
        used -165,392 bytes
        free +165,392 bytes

**GCs**
        full 0 totalling 0ms (0.0% uptime)
        incr 7133 totalling 1,326ms (3.0% uptime), avg 0.0ms
        tenures 1 (avg 7133 GCs/tenure)
        root table 0 overflows

Reply | Threaded
Open this post in threaded view
|

Re: Floating point performance

Joshua Gargus-2

On Dec 13, 2006, at 4:33 PM, David Faught wrote:

> John M McIntosh wrote:
>> Could you share your messageTally. If you are using floatarray logic
>> then most of the math is done in the plugin. However
>> the plugin does not take advantage of any vector processing hardware
>> you might have so there is room for improvement.
>
> The MessageTally output is below.  Maybe "almost 80% of the time was
> spent in basic floating point array operations" is a little
> exaggerated, but not a lot.
> What vector processing hardware?

Like SSE or MMX on Intel, or Altivec on PowerPC.

> The
> only thing I know of would be trying to use the video card GPU, which
> could be lots of fun!
>
>> Also if  you have say a+b*c-d  in smalltalk where these are float
>> array objevcts that would three primitive  interactions, converting
>> that to slang would provide some performance improvements.
>
> I'm not sure I understand this statement.  Is there enough overhead in
> the plugin API to justify eliminating a couple of calls, or is there
> some data representation conversion involved that could be avoided?
>
> I haven't read Andrew Greenberg's chapter on "Extending the Squeak
> Virtual Machine" in detail yet.  I kind of skimmed over the sections
> "The Shape of a Smalltalk Object" and "The Anatomy of a Named
> Primitive", which I'm sure is where all the good stuff is.  Are you
> saying that some performance improvement in your sample expression
> could be gained by just coding it in Slang, without translating and
> compiling it, or have I gone one step too far?
>

I see about 43% on float array arithmetic and another 4% on regular  
float arithmetic.  The float array arithmetic is being done on very  
short arrays (3 elements), so the call overhead might be significant  
(i.e. John's suggestion re: a+b*c-d might pay dividends); if you're  
dealing with 10000-element float arrays, then the call overhead is  
negligible.

The big problem seems to be that you're spending a lot of time  
unpacking and repacking B3DVector3Arrays.  If you write a primitive  
to do the computation, then you can avoid all of the #at: and  
#at:put: overhead, the overhead for allocating and garbage collecting  
all of the intermediate B3DVectors, and much of the call overhead for  
*, +, and -.

The following experiment might be helpful:
a := Vector3Array new: 1000.
b := Vector3 new.
[1000 timesRepeat: [a + a]] timeToRun. "18ms"
[1000 timesRepeat: [a += a]] timeToRun. "5ms"
[1000000 timesRepeat: [b + b]] timeToRun. "481ms"
[1000000 timesRepeat: [b += b]] timeToRun. "258ms"

In all cases we're adding a million pairs of vect3s.  The first and  
third are slower than the second and fourth due to the allocation of  
a new target array each time; this gives you an idea of the  
achievable gains if you can restructure you algorithms to re-use an  
intermediate target array.  The first two are faster than the last  
two because they're not doing as much work in Squeak: less iterations  
and fewer primitive calls.  Note that these last two still don't  
involve unpacking and packing arrays.  Your code seems to be doing  
something like:

[1000 timesRepeat: [a do: [:aa | aa + aa]]] timeToRun.  "1346ms"
or perhaps
[1000 timesRepeat: [1 to: 1000 do: [:i | (a at: i ) + (a at: i)]]]  
timeToRun "1286ms"

In short, it looks like there is a lot of room for improvement.

Josh

>
> - 2441 tallies, 39083 msec.
>
> **Tree**
> 100.0% {39083ms} TClothOxe>>pulse
>  77.8% {30407ms} TClothOxe>>constrain
>    |77.8% {30407ms} TClothOxe>>constrain:
>    |  14.2% {5550ms} B3DVector3(FloatArray)>>*
>    |  13.9% {5433ms} B3DVector3(FloatArray)>>-
>    |  12.2% {4768ms} B3DVector3Array>>at:
>    |  9.7% {3791ms} TClothOxe>>collide
>    |    |9.7% {3791ms} TClothOxe>>collideSphere:
>    |    |  3.6% {1407ms} B3DVector3(FloatArray)>>length
>    |    |  3.0% {1172ms} B3DVector3(FloatArray)>>-
>    |    |  2.9% {1133ms} B3DVector3Array(SequenceableCollection)
> >>doWithIndex:
>    |    |    2.9% {1133ms}
> B3DVector3Array(SequenceableCollection)>>withIndexDo:
>    |  8.8% {3439ms} B3DVector3(FloatArray)>>+
>    |  6.3% {2462ms} B3DVector3Array>>at:put:
>    |  5.8% {2267ms} TClothOxe>>constrainGround
>    |    |3.2% {1251ms} B3DVector3Array(B3DInplaceArray)>>do:
>    |    |2.6% {1016ms} B3DVector3>>y
>    |  3.8% {1485ms} OrderedCollection>>do:
>    |  2.8% {1094ms} primitives
>  7.0% {2736ms} B3DVector3Array(SequenceableCollection)
> >>replaceFrom:to:with:
>    |7.0% {2736ms}
> B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt:
>    |  2.7% {1055ms} B3DVector3Array>>at:put:
>    |  2.5% {977ms} B3DVector3Array>>at:
>  4.4% {1720ms} Float>>*
>    |2.4% {938ms} B3DVector3(Object)>>adaptToFloat:andSend:
>    |2.0% {782ms} primitives
>  3.2% {1251ms} B3DVector3(FloatArray)>>-
>  2.3% {899ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
>    2.3% {899ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
> **Leaves**
> 20.1% {7856ms} B3DVector3(FloatArray)>>-
> 19.8% {7738ms} B3DVector3Array>>at:
> 15.9% {6214ms} B3DVector3(FloatArray)>>*
> 11.8% {4612ms} B3DVector3Array>>at:put:
> 10.9% {4260ms} B3DVector3(FloatArray)>>+
> 3.8% {1485ms} OrderedCollection>>do:
> 2.8% {1094ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
> 2.8% {1094ms} TClothOxe>>constrain:
> 2.6% {1016ms} B3DVector3>>y
> 2.0% {782ms} Float>>*
>
> **Memory**
> old +386,532 bytes
> young -551,924 bytes
> used -165,392 bytes
> free +165,392 bytes
>
> **GCs**
> full 0 totalling 0ms (0.0% uptime)
> incr 7133 totalling 1,326ms (3.0% uptime), avg 0.0ms
> tenures 1 (avg 7133 GCs/tenure)
> root table 0 overflows
>


Reply | Threaded
Open this post in threaded view
|

Re: Floating point performance

news.gmane.org-2
In reply to this post by David Faught
David Faught <dave.faught <at> gmail.com> writes:


> The MessageTally output is below.  Maybe "almost 80% of the time was
> spent in basic floating point array operations" is a little
> exaggerated, but not a lot.  What vector processing hardware?  

If you link up to the Intel MKL (Maths Kernel Library), at least for Intel
chips, I believe you will take advantage of any special instructions. MKL has
the same API as LAPACK or BLAS.

> The
> only thing I know of would be trying to use the video card GPU, which
> could be lots of fun!

BTW has anybody ported Squeak to a GPU? Have you seen how folding@home is
leveraging GPUs?

http://folding.stanford.edu/FAQ-ATI.html