John M McIntosh wrote:
>Could you share your messageTally. If you are using floatarray logic >then most of the math is done in the plugin. However >the plugin does not take advantage of any vector processing hardware >you might have so there is room for improvement. The MessageTally output is below. Maybe "almost 80% of the time was spent in basic floating point array operations" is a little exaggerated, but not a lot. What vector processing hardware? The only thing I know of would be trying to use the video card GPU, which could be lots of fun! >Also if you have say a+b*c-d in smalltalk where these are float >array objevcts that would three primitive interactions, converting >that to slang would provide some performance improvements. I'm not sure I understand this statement. Is there enough overhead in the plugin API to justify eliminating a couple of calls, or is there some data representation conversion involved that could be avoided? I haven't read Andrew Greenberg's chapter on "Extending the Squeak Virtual Machine" in detail yet. I kind of skimmed over the sections "The Shape of a Smalltalk Object" and "The Anatomy of a Named Primitive", which I'm sure is where all the good stuff is. Are you saying that some performance improvement in your sample expression could be gained by just coding it in Slang, without translating and compiling it, or have I gone one step too far? - 2441 tallies, 39083 msec. **Tree** 100.0% {39083ms} TClothOxe>>pulse 77.8% {30407ms} TClothOxe>>constrain |77.8% {30407ms} TClothOxe>>constrain: | 14.2% {5550ms} B3DVector3(FloatArray)>>* | 13.9% {5433ms} B3DVector3(FloatArray)>>- | 12.2% {4768ms} B3DVector3Array>>at: | 9.7% {3791ms} TClothOxe>>collide | |9.7% {3791ms} TClothOxe>>collideSphere: | | 3.6% {1407ms} B3DVector3(FloatArray)>>length | | 3.0% {1172ms} B3DVector3(FloatArray)>>- | | 2.9% {1133ms} B3DVector3Array(SequenceableCollection)>>doWithIndex: | | 2.9% {1133ms} B3DVector3Array(SequenceableCollection)>>withIndexDo: | 8.8% {3439ms} B3DVector3(FloatArray)>>+ | 6.3% {2462ms} B3DVector3Array>>at:put: | 5.8% {2267ms} TClothOxe>>constrainGround | |3.2% {1251ms} B3DVector3Array(B3DInplaceArray)>>do: | |2.6% {1016ms} B3DVector3>>y | 3.8% {1485ms} OrderedCollection>>do: | 2.8% {1094ms} primitives 7.0% {2736ms} B3DVector3Array(SequenceableCollection)>>replaceFrom:to:with: |7.0% {2736ms} B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt: | 2.7% {1055ms} B3DVector3Array>>at:put: | 2.5% {977ms} B3DVector3Array>>at: 4.4% {1720ms} Float>>* |2.4% {938ms} B3DVector3(Object)>>adaptToFloat:andSend: |2.0% {782ms} primitives 3.2% {1251ms} B3DVector3(FloatArray)>>- 2.3% {899ms} B3DVector3Array(SequenceableCollection)>>doWithIndex: 2.3% {899ms} B3DVector3Array(SequenceableCollection)>>withIndexDo: **Leaves** 20.1% {7856ms} B3DVector3(FloatArray)>>- 19.8% {7738ms} B3DVector3Array>>at: 15.9% {6214ms} B3DVector3(FloatArray)>>* 11.8% {4612ms} B3DVector3Array>>at:put: 10.9% {4260ms} B3DVector3(FloatArray)>>+ 3.8% {1485ms} OrderedCollection>>do: 2.8% {1094ms} B3DVector3Array(SequenceableCollection)>>withIndexDo: 2.8% {1094ms} TClothOxe>>constrain: 2.6% {1016ms} B3DVector3>>y 2.0% {782ms} Float>>* **Memory** old +386,532 bytes young -551,924 bytes used -165,392 bytes free +165,392 bytes **GCs** full 0 totalling 0ms (0.0% uptime) incr 7133 totalling 1,326ms (3.0% uptime), avg 0.0ms tenures 1 (avg 7133 GCs/tenure) root table 0 overflows |
On Dec 13, 2006, at 4:33 PM, David Faught wrote: > John M McIntosh wrote: >> Could you share your messageTally. If you are using floatarray logic >> then most of the math is done in the plugin. However >> the plugin does not take advantage of any vector processing hardware >> you might have so there is room for improvement. > > The MessageTally output is below. Maybe "almost 80% of the time was > spent in basic floating point array operations" is a little > exaggerated, but not a lot. > What vector processing hardware? Like SSE or MMX on Intel, or Altivec on PowerPC. > The > only thing I know of would be trying to use the video card GPU, which > could be lots of fun! > >> Also if you have say a+b*c-d in smalltalk where these are float >> array objevcts that would three primitive interactions, converting >> that to slang would provide some performance improvements. > > I'm not sure I understand this statement. Is there enough overhead in > the plugin API to justify eliminating a couple of calls, or is there > some data representation conversion involved that could be avoided? > > I haven't read Andrew Greenberg's chapter on "Extending the Squeak > Virtual Machine" in detail yet. I kind of skimmed over the sections > "The Shape of a Smalltalk Object" and "The Anatomy of a Named > Primitive", which I'm sure is where all the good stuff is. Are you > saying that some performance improvement in your sample expression > could be gained by just coding it in Slang, without translating and > compiling it, or have I gone one step too far? > I see about 43% on float array arithmetic and another 4% on regular float arithmetic. The float array arithmetic is being done on very short arrays (3 elements), so the call overhead might be significant (i.e. John's suggestion re: a+b*c-d might pay dividends); if you're dealing with 10000-element float arrays, then the call overhead is negligible. The big problem seems to be that you're spending a lot of time unpacking and repacking B3DVector3Arrays. If you write a primitive to do the computation, then you can avoid all of the #at: and #at:put: overhead, the overhead for allocating and garbage collecting all of the intermediate B3DVectors, and much of the call overhead for *, +, and -. The following experiment might be helpful: a := Vector3Array new: 1000. b := Vector3 new. [1000 timesRepeat: [a + a]] timeToRun. "18ms" [1000 timesRepeat: [a += a]] timeToRun. "5ms" [1000000 timesRepeat: [b + b]] timeToRun. "481ms" [1000000 timesRepeat: [b += b]] timeToRun. "258ms" In all cases we're adding a million pairs of vect3s. The first and third are slower than the second and fourth due to the allocation of a new target array each time; this gives you an idea of the achievable gains if you can restructure you algorithms to re-use an intermediate target array. The first two are faster than the last two because they're not doing as much work in Squeak: less iterations and fewer primitive calls. Note that these last two still don't involve unpacking and packing arrays. Your code seems to be doing something like: [1000 timesRepeat: [a do: [:aa | aa + aa]]] timeToRun. "1346ms" or perhaps [1000 timesRepeat: [1 to: 1000 do: [:i | (a at: i ) + (a at: i)]]] timeToRun "1286ms" In short, it looks like there is a lot of room for improvement. Josh > > - 2441 tallies, 39083 msec. > > **Tree** > 100.0% {39083ms} TClothOxe>>pulse > 77.8% {30407ms} TClothOxe>>constrain > |77.8% {30407ms} TClothOxe>>constrain: > | 14.2% {5550ms} B3DVector3(FloatArray)>>* > | 13.9% {5433ms} B3DVector3(FloatArray)>>- > | 12.2% {4768ms} B3DVector3Array>>at: > | 9.7% {3791ms} TClothOxe>>collide > | |9.7% {3791ms} TClothOxe>>collideSphere: > | | 3.6% {1407ms} B3DVector3(FloatArray)>>length > | | 3.0% {1172ms} B3DVector3(FloatArray)>>- > | | 2.9% {1133ms} B3DVector3Array(SequenceableCollection) > >>doWithIndex: > | | 2.9% {1133ms} > B3DVector3Array(SequenceableCollection)>>withIndexDo: > | 8.8% {3439ms} B3DVector3(FloatArray)>>+ > | 6.3% {2462ms} B3DVector3Array>>at:put: > | 5.8% {2267ms} TClothOxe>>constrainGround > | |3.2% {1251ms} B3DVector3Array(B3DInplaceArray)>>do: > | |2.6% {1016ms} B3DVector3>>y > | 3.8% {1485ms} OrderedCollection>>do: > | 2.8% {1094ms} primitives > 7.0% {2736ms} B3DVector3Array(SequenceableCollection) > >>replaceFrom:to:with: > |7.0% {2736ms} > B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt: > | 2.7% {1055ms} B3DVector3Array>>at:put: > | 2.5% {977ms} B3DVector3Array>>at: > 4.4% {1720ms} Float>>* > |2.4% {938ms} B3DVector3(Object)>>adaptToFloat:andSend: > |2.0% {782ms} primitives > 3.2% {1251ms} B3DVector3(FloatArray)>>- > 2.3% {899ms} B3DVector3Array(SequenceableCollection)>>doWithIndex: > 2.3% {899ms} B3DVector3Array(SequenceableCollection)>>withIndexDo: > **Leaves** > 20.1% {7856ms} B3DVector3(FloatArray)>>- > 19.8% {7738ms} B3DVector3Array>>at: > 15.9% {6214ms} B3DVector3(FloatArray)>>* > 11.8% {4612ms} B3DVector3Array>>at:put: > 10.9% {4260ms} B3DVector3(FloatArray)>>+ > 3.8% {1485ms} OrderedCollection>>do: > 2.8% {1094ms} B3DVector3Array(SequenceableCollection)>>withIndexDo: > 2.8% {1094ms} TClothOxe>>constrain: > 2.6% {1016ms} B3DVector3>>y > 2.0% {782ms} Float>>* > > **Memory** > old +386,532 bytes > young -551,924 bytes > used -165,392 bytes > free +165,392 bytes > > **GCs** > full 0 totalling 0ms (0.0% uptime) > incr 7133 totalling 1,326ms (3.0% uptime), avg 0.0ms > tenures 1 (avg 7133 GCs/tenure) > root table 0 overflows > |
In reply to this post by David Faught
David Faught <dave.faught <at> gmail.com> writes:
> The MessageTally output is below. Maybe "almost 80% of the time was > spent in basic floating point array operations" is a little > exaggerated, but not a lot. What vector processing hardware? If you link up to the Intel MKL (Maths Kernel Library), at least for Intel chips, I believe you will take advantage of any special instructions. MKL has the same API as LAPACK or BLAS. > The > only thing I know of would be trying to use the video card GPU, which > could be lots of fun! BTW has anybody ported Squeak to a GPU? Have you seen how folding@home is leveraging GPUs? http://folding.stanford.edu/FAQ-ATI.html |
Free forum by Nabble | Edit this page |