After some discussion on and off the list, I tried a few line rewrite
of the main time consuming method for the purpose of avoiding the creation of unneeded intermediate result objects. Here is the method before and after the rewrite: ----before---- constrain: its | v1 v2 dv r2 | self constrainGround. 1 to: its do: [:iter| connections do: [:con| v1 _ positions at: (con node1). v2 _ positions at: (con node2). dv _ v2 - v1. r2 _ con restLength * con restLength. "fast square-root aproximation:" dv _ dv * (r2 / ((dv dot: dv) + r2) - 0.5). positions at: (con node1) put: v1 - dv. positions at: (con node2) put: v2 + dv. ]. ]. self collide. self doHoldCorners. ----after---- constrain2: its | v1 v2 dv r2 | self constrainGround. 1 to: its do: [:iter| connections do: [:con| v1 _ positions at: (con node1). v2 _ positions at: (con node2). dv _ v2 clone. dv -= v1. r2 _ con restLength * con restLength. "fast square-root aproximation:" dv *= (r2 / ((dv dot: dv) + r2) - 0.5). v1 -= dv. v2 += dv. positions at: (con node1) put: v1. positions at: (con node2) put: v2. ]. ]. self collide. self doHoldCorners. ---- I was expecting good things as a result of this, but was rather disappointed. The before and after tally results are below. They show that the B3DVector3(FloatArray) *, -, and + operation times went away (as expected) with pretty big increases in the primitives (this is just shifted from the original operations) and B3DVector3Array>>at:put: times, which was not expected. What happened, especially with the at:put: times? I could see a shift like this in the percentages, but the actual measured times went way up too, with the overall total time being not very much less for the "optimized" version. Any ideas? ----before---- - 3213 tallies, 51993 msec. **Tree** 100.0% {51993ms} TClothOxe>>pulse 79.8% {41490ms} TClothOxe>>constrain |79.8% {41490ms} TClothOxe>>constrain: | 16.9% {8787ms} B3DVector3(FloatArray)>>* | 14.6% {7591ms} B3DVector3(FloatArray)>>- | 12.2% {6343ms} B3DVector3Array>>at: | 10.2% {5303ms} TClothOxe>>collide | |10.1% {5251ms} TClothOxe>>collideSphere: | | 4.0% {2080ms} B3DVector3(FloatArray)>>length | | |2.2% {1144ms} primitives | | 3.3% {1716ms} B3DVector3Array(SequenceableCollection)>>doWithIndex: | | |3.3% {1716ms} B3DVector3Array(SequenceableCollection)>>withIndexDo: | | 2.7% {1404ms} B3DVector3(FloatArray)>>- | 7.7% {4003ms} B3DVector3(FloatArray)>>+ | 5.8% {3016ms} TClothOxe>>constrainGround | |3.2% {1664ms} B3DVector3>>y | |2.6% {1352ms} B3DVector3Array(B3DInplaceArray)>>do: | 5.0% {2600ms} B3DVector3Array>>at:put: | 4.5% {2340ms} primitives | 2.4% {1248ms} OrderedCollection>>do: 6.9% {3588ms} B3DVector3Array(SequenceableCollection)>>replaceFrom:to:with: |6.9% {3588ms} B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt: | 2.9% {1508ms} B3DVector3Array>>at: | 2.6% {1352ms} B3DVector3Array>>at:put: 3.7% {1924ms} Float>>* |2.4% {1248ms} B3DVector3(Object)>>adaptToFloat:andSend: 2.7% {1404ms} B3DVector3(FloatArray)>>- 2.4% {1248ms} B3DVector3Array(SequenceableCollection)>>doWithIndex: 2.4% {1248ms} B3DVector3Array(SequenceableCollection)>>withIndexDo: **Leaves** 20.4% {10607ms} B3DVector3Array>>at: 20.0% {10399ms} B3DVector3(FloatArray)>>- 18.5% {9619ms} B3DVector3(FloatArray)>>* 9.6% {4991ms} B3DVector3Array>>at:put: 9.4% {4887ms} B3DVector3(FloatArray)>>+ 4.5% {2340ms} TClothOxe>>constrain: 3.2% {1664ms} B3DVector3>>y 3.1% {1612ms} B3DVector3Array(SequenceableCollection)>>withIndexDo: 2.4% {1248ms} OrderedCollection>>do: 2.2% {1144ms} B3DVector3(FloatArray)>>length **Memory** old +0 bytes young +30,264 bytes used +30,264 bytes free -30,264 bytes **GCs** full 0 totalling 0ms (0.0% uptime) incr 7139 totalling 13,565ms (26.0% uptime), avg 2.0ms tenures 0 root table 0 overflows ----after---- - 2799 tallies, 45160 msec. **Tree** 100.0% {45160ms} TClothOxe>>pulse 75.8% {34231ms} TClothOxe>>constrain |75.8% {34231ms} TClothOxe>>constrain2: | 24.1% {10884ms} B3DVector3Array>>at:put: | 15.6% {7045ms} B3DVector3Array>>at: | 14.2% {6413ms} primitives | 11.6% {5239ms} TClothOxe>>collide | |11.6% {5239ms} TClothOxe>>collideSphere: | | 4.3% {1942ms} B3DVector3(FloatArray)>>length | | |2.2% {994ms} primitives | | |2.1% {948ms} B3DVector3(FloatArray)>>squaredLength | | 3.8% {1716ms} B3DVector3Array(SequenceableCollection)>>doWithIndex: | | |3.8% {1716ms} B3DVector3Array(SequenceableCollection)>>withIndexDo: | | | 2.1% {948ms} primitives | | 3.3% {1490ms} B3DVector3(FloatArray)>>- | 6.0% {2710ms} TClothOxe>>constrainGround | |3.6% {1626ms} B3DVector3Array(B3DInplaceArray)>>do: | |2.4% {1084ms} B3DVector3>>y | 3.6% {1626ms} OrderedCollection>>do: 7.0% {3161ms} B3DVector3Array(SequenceableCollection)>>replaceFrom:to:with: |7.0% {3161ms} B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt: | 2.5% {1129ms} B3DVector3Array>>at: | 2.5% {1129ms} B3DVector3Array>>at:put: 5.0% {2258ms} Float>>* |2.9% {1310ms} B3DVector3(Object)>>adaptToFloat:andSend: | |2.5% {1129ms} B3DVector3(FloatArray)>>adaptToNumber:andSend: | | 2.1% {948ms} B3DVector3(FloatArray)>>* |2.1% {948ms} primitives 3.1% {1400ms} B3DVector3Array(SequenceableCollection)>>doWithIndex: |3.1% {1400ms} B3DVector3Array(SequenceableCollection)>>withIndexDo: 2.8% {1264ms} B3DVector3(FloatArray)>>+ 2.7% {1219ms} B3DVector3(FloatArray)>>- 2.1% {948ms} B3DVector3Array>>at: **Leaves** 29.3% {13232ms} B3DVector3Array>>at:put: 25.1% {11335ms} B3DVector3Array>>at: 14.2% {6413ms} TClothOxe>>constrain2: 6.0% {2710ms} B3DVector3(FloatArray)>>- 3.6% {1626ms} OrderedCollection>>do: 3.5% {1581ms} B3DVector3Array(SequenceableCollection)>>withIndexDo: 3.0% {1355ms} B3DVector3(FloatArray)>>+ 2.4% {1084ms} B3DVector3>>y 2.2% {994ms} B3DVector3(FloatArray)>>length 2.2% {994ms} B3DVector3(FloatArray)>>* 2.1% {948ms} B3DVector3(FloatArray)>>squaredLength 2.1% {948ms} Float>>* **Memory** old +0 bytes young -92,828 bytes used -92,828 bytes free +92,828 bytes **GCs** full 0 totalling 0ms (0.0% uptime) incr 5804 totalling 11,186ms (25.0% uptime), avg 2.0ms tenures 0 root table 0 overflows ---- |
David Faught wrote:
> I was expecting good things as a result of this, but was rather > disappointed. The before and after tally results are below. They > show that the B3DVector3(FloatArray) *, -, and + operation times went > away (as expected) with pretty big increases in the primitives (this > is just shifted from the original operations) and > B3DVector3Array>>at:put: times, which was not expected. What > happened, especially with the at:put: times? Two possibilities: First, it may be a sampling error since you are only using 16 msecs intervals which is extremely coarse and should not be used for such a quantitative comparison. Get down to 1 msec instead. Second, I have occasionally seen primitive tallies to be assigned to other than their containing methods. I'm not sure why this happens but the only way to find out for sure is to convert all of the primitives into "real sends" (e.g., put extra methods in which call the primitives). > I could see a shift like this in the percentages, but the actual > measured times went way up too, with the overall total time being not > very much less for the "optimized" version. Any ideas? As an overall, the difference isn't too surprising. Assuming this was running with the same parameters, you get some six seconds overall speedup which is a roughly 12% speedup. That's not bad and you should continue along those lines (like actually measuring the "fast square root" since I'm not convinced that it's either correct or faster than a straightforward sqrt). Cheers, - Andreas |
In reply to this post by David Faught
David Faught <dave.faught <at> gmail.com> writes:
> > After some discussion on and off the list, I tried a few line rewrite > of the main time consuming method for the purpose of avoiding the > creation of unneeded intermediate result objects. Here is the method > before and after the rewrite: Hi David Very interesting post. We have just finished a commercial project where we took a Matlab multi-body dynamics system and converted it into VSE (Visual Smalltalk Enterprise). We are expecting to have to translate the central algorithm into C/Fortran at some time, but we have successfully postponed that optimisation for the time being. Incidently, matlab suffers from the same boxing/unboxing problem that smalltalk does, since everything in matlab is a matrix. When I benchmarked matlab a few years ago, it was an order of magnitudes slower than VSE for float ops (not matrices). I may be faster now that they have a jitter. Scanning your code, you have the same issues that we had. As has been mentioned, every time you have an operation that has an intermediate result, you have a object allocation problem. > connections do: [:con| > v1 _ positions at: (con node1). > v2 _ positions at: (con node2). > dv _ v2 clone. dv -= v1. Here you clonning something in the loop. Try allocating the temp outside the loop, and reuse it each iteration by filling it with zeros at the start of the loop. We had a Fortran primitive to fill arrays with a constant value. > r2 _ con restLength * con restLength. > "fast square-root aproximation:" > dv *= (r2 / ((dv dot: dv) + r2) - 0.5). If this fast square-root approximation is good, then move it into Fortran/c as a primitive. I may be wrong, but the last time I glanced at the *= method on Matrix, it was a primitive only when both arguments are Matrixes. Otherwise it is implemented in Squeak > v1 -= dv. v2 += dv. > positions at: (con node1) put: v1. > positions at: (con node2) put: v2. > ]. > ]. After we did all these tweaks, we found that the only bottleneck was unpacking the state vector before doing the 'real' calc, and re-packing the state vector afterwards. We tried rewritting those in Fortran, but it didn't help. The problem there was the overhead in setting up a DLL call. The other thing you have to look out for is whether Squeak does an marshalling of data before or after a primitive call. VSE coppied arguments into a buffer before a DLL call, and then coppies them back afterwards, which slows things down when you have a 1000*1000 matrix! We solved that problem by creating a special matrix class that allocated space on the C heap, and only passed the heap address across. To sumarize: linear calcs are fine in Smalltalk. Nonlinear stuff has to go into Fortran (eventually). at: and at:put: are a pain! I hope that helps Cheers Daniel |
daniel poon <mr.d.poon <at> gmail.com> writes:
> After we did all these tweaks, we found that the only bottleneck was unpacking > the state vector before doing the 'real' calc, and re-packing the state vector > afterwards. We tried rewritting those in Fortran, but it didn't help. The > problem there was the overhead in setting up a DLL call. I forgot to mention that all our calculations were performed as a callback from an ODE (ordinary differential equation) solver written in Fortran, and the mechanics of the callback in VSE was a very very significant overhead. Hence my thread elsewhere about callbacks in Squeak. Cheers Daniel |
Free forum by Nabble | Edit this page |