Floating point performance again

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Floating point performance again

David Faught
After some discussion on and off the list, I tried a few line rewrite
of the main time consuming method for the purpose of avoiding the
creation of unneeded intermediate result objects.  Here is the method
before and after the rewrite:

----before----
constrain: its


        | v1 v2 dv r2 |

        self constrainGround.

        1 to: its do: [:iter|
                connections do: [:con|
                        v1 _ positions at: (con node1).
                        v2 _ positions at: (con node2).
                        dv _ v2 - v1.
                        r2 _ con restLength * con restLength.
                        "fast square-root aproximation:"
                        dv _ dv * (r2 / ((dv dot: dv) + r2) - 0.5).
                        positions at: (con node1) put: v1 - dv.
                        positions at: (con node2) put: v2 + dv.
                        ].
                ].

        self collide.
        self doHoldCorners.

----after----
constrain2: its


        | v1 v2 dv r2 |

        self constrainGround.

        1 to: its do: [:iter|
                connections do: [:con|
                        v1 _ positions at: (con node1).
                        v2 _ positions at: (con node2).
                        dv _ v2 clone. dv -= v1.
                        r2 _ con restLength * con restLength.
                        "fast square-root aproximation:"
                        dv *= (r2 / ((dv dot: dv) + r2) - 0.5).
                        v1 -= dv. v2 += dv.
                        positions at: (con node1) put: v1.
                        positions at: (con node2) put: v2.
                        ].
                ].

        self collide.
        self doHoldCorners.
----

I was expecting good things as a result of this, but was rather
disappointed.  The before and after tally results are below.  They
show that the B3DVector3(FloatArray) *, -, and + operation times went
away (as expected) with pretty big increases in the primitives (this
is just shifted from the original operations) and
B3DVector3Array>>at:put: times, which was not expected.  What
happened, especially with the at:put: times?

I could see a shift like this in the percentages, but the actual
measured times went way up too, with the overall total time being not
very much less for the "optimized" version.  Any ideas?

----before----
 - 3213 tallies, 51993 msec.

**Tree**
100.0% {51993ms} TClothOxe>>pulse
  79.8% {41490ms} TClothOxe>>constrain
    |79.8% {41490ms} TClothOxe>>constrain:
    |  16.9% {8787ms} B3DVector3(FloatArray)>>*
    |  14.6% {7591ms} B3DVector3(FloatArray)>>-
    |  12.2% {6343ms} B3DVector3Array>>at:
    |  10.2% {5303ms} TClothOxe>>collide
    |    |10.1% {5251ms} TClothOxe>>collideSphere:
    |    |  4.0% {2080ms} B3DVector3(FloatArray)>>length
    |    |    |2.2% {1144ms} primitives
    |    |  3.3% {1716ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
    |    |    |3.3% {1716ms}
B3DVector3Array(SequenceableCollection)>>withIndexDo:
    |    |  2.7% {1404ms} B3DVector3(FloatArray)>>-
    |  7.7% {4003ms} B3DVector3(FloatArray)>>+
    |  5.8% {3016ms} TClothOxe>>constrainGround
    |    |3.2% {1664ms} B3DVector3>>y
    |    |2.6% {1352ms} B3DVector3Array(B3DInplaceArray)>>do:
    |  5.0% {2600ms} B3DVector3Array>>at:put:
    |  4.5% {2340ms} primitives
    |  2.4% {1248ms} OrderedCollection>>do:
  6.9% {3588ms} B3DVector3Array(SequenceableCollection)>>replaceFrom:to:with:
    |6.9% {3588ms}
B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt:
    |  2.9% {1508ms} B3DVector3Array>>at:
    |  2.6% {1352ms} B3DVector3Array>>at:put:
  3.7% {1924ms} Float>>*
    |2.4% {1248ms} B3DVector3(Object)>>adaptToFloat:andSend:
  2.7% {1404ms} B3DVector3(FloatArray)>>-
  2.4% {1248ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
    2.4% {1248ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:

**Leaves**
20.4% {10607ms} B3DVector3Array>>at:
20.0% {10399ms} B3DVector3(FloatArray)>>-
18.5% {9619ms} B3DVector3(FloatArray)>>*
9.6% {4991ms} B3DVector3Array>>at:put:
9.4% {4887ms} B3DVector3(FloatArray)>>+
4.5% {2340ms} TClothOxe>>constrain:
3.2% {1664ms} B3DVector3>>y
3.1% {1612ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
2.4% {1248ms} OrderedCollection>>do:
2.2% {1144ms} B3DVector3(FloatArray)>>length

**Memory**
        old +0 bytes
        young +30,264 bytes
        used +30,264 bytes
        free -30,264 bytes

**GCs**
        full 0 totalling 0ms (0.0% uptime)
        incr 7139 totalling 13,565ms (26.0% uptime), avg 2.0ms
        tenures 0
        root table 0 overflows

----after----
 - 2799 tallies, 45160 msec.

**Tree**
100.0% {45160ms} TClothOxe>>pulse
  75.8% {34231ms} TClothOxe>>constrain
    |75.8% {34231ms} TClothOxe>>constrain2:
    |  24.1% {10884ms} B3DVector3Array>>at:put:
    |  15.6% {7045ms} B3DVector3Array>>at:
    |  14.2% {6413ms} primitives
    |  11.6% {5239ms} TClothOxe>>collide
    |    |11.6% {5239ms} TClothOxe>>collideSphere:
    |    |  4.3% {1942ms} B3DVector3(FloatArray)>>length
    |    |    |2.2% {994ms} primitives
    |    |    |2.1% {948ms} B3DVector3(FloatArray)>>squaredLength
    |    |  3.8% {1716ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
    |    |    |3.8% {1716ms}
B3DVector3Array(SequenceableCollection)>>withIndexDo:
    |    |    |  2.1% {948ms} primitives
    |    |  3.3% {1490ms} B3DVector3(FloatArray)>>-
    |  6.0% {2710ms} TClothOxe>>constrainGround
    |    |3.6% {1626ms} B3DVector3Array(B3DInplaceArray)>>do:
    |    |2.4% {1084ms} B3DVector3>>y
    |  3.6% {1626ms} OrderedCollection>>do:
  7.0% {3161ms} B3DVector3Array(SequenceableCollection)>>replaceFrom:to:with:
    |7.0% {3161ms}
B3DVector3Array(B3DInplaceArray)>>replaceFrom:to:with:startingAt:
    |  2.5% {1129ms} B3DVector3Array>>at:
    |  2.5% {1129ms} B3DVector3Array>>at:put:
  5.0% {2258ms} Float>>*
    |2.9% {1310ms} B3DVector3(Object)>>adaptToFloat:andSend:
    |  |2.5% {1129ms} B3DVector3(FloatArray)>>adaptToNumber:andSend:
    |  |  2.1% {948ms} B3DVector3(FloatArray)>>*
    |2.1% {948ms} primitives
  3.1% {1400ms} B3DVector3Array(SequenceableCollection)>>doWithIndex:
    |3.1% {1400ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
  2.8% {1264ms} B3DVector3(FloatArray)>>+
  2.7% {1219ms} B3DVector3(FloatArray)>>-
  2.1% {948ms} B3DVector3Array>>at:

**Leaves**
29.3% {13232ms} B3DVector3Array>>at:put:
25.1% {11335ms} B3DVector3Array>>at:
14.2% {6413ms} TClothOxe>>constrain2:
6.0% {2710ms} B3DVector3(FloatArray)>>-
3.6% {1626ms} OrderedCollection>>do:
3.5% {1581ms} B3DVector3Array(SequenceableCollection)>>withIndexDo:
3.0% {1355ms} B3DVector3(FloatArray)>>+
2.4% {1084ms} B3DVector3>>y
2.2% {994ms} B3DVector3(FloatArray)>>length
2.2% {994ms} B3DVector3(FloatArray)>>*
2.1% {948ms} B3DVector3(FloatArray)>>squaredLength
2.1% {948ms} Float>>*

**Memory**
        old +0 bytes
        young -92,828 bytes
        used -92,828 bytes
        free +92,828 bytes

**GCs**
        full 0 totalling 0ms (0.0% uptime)
        incr 5804 totalling 11,186ms (25.0% uptime), avg 2.0ms
        tenures 0
        root table 0 overflows

----

Reply | Threaded
Open this post in threaded view
|

Re: Floating point performance again

Andreas.Raab
David Faught wrote:
> I was expecting good things as a result of this, but was rather
> disappointed.  The before and after tally results are below.  They
> show that the B3DVector3(FloatArray) *, -, and + operation times went
> away (as expected) with pretty big increases in the primitives (this
> is just shifted from the original operations) and
> B3DVector3Array>>at:put: times, which was not expected.  What
> happened, especially with the at:put: times?

Two possibilities: First, it may be a sampling error since you are only
using 16 msecs intervals which is extremely coarse and should not be
used for such a quantitative comparison. Get down to 1 msec instead.
Second, I have occasionally seen primitive tallies to be assigned to
other than their containing methods. I'm not sure why this happens but
the only way to find out for sure is to convert all of the primitives
into "real sends" (e.g., put extra methods in which call the primitives).

> I could see a shift like this in the percentages, but the actual
> measured times went way up too, with the overall total time being not
> very much less for the "optimized" version.  Any ideas?

As an overall, the difference isn't too surprising. Assuming this was
running with the same parameters, you get some six seconds overall
speedup which is a roughly 12% speedup. That's not bad and you should
continue along those lines (like actually measuring the "fast square
root" since I'm not convinced that it's either correct or faster than a
straightforward sqrt).


Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Floating point performance again

news.gmane.org-2
In reply to this post by David Faught
David Faught <dave.faught <at> gmail.com> writes:

>
> After some discussion on and off the list, I tried a few line rewrite
> of the main time consuming method for the purpose of avoiding the
> creation of unneeded intermediate result objects.  Here is the method
> before and after the rewrite:

Hi David

Very interesting post. We have just finished a commercial project where we took
a Matlab multi-body dynamics system and converted it into VSE (Visual Smalltalk
Enterprise). We are expecting to have to translate the central algorithm into
C/Fortran at some time, but we have successfully postponed that optimisation for
the time being.

Incidently, matlab suffers from the same boxing/unboxing problem that smalltalk
does, since everything in matlab is a matrix. When I benchmarked matlab a few
years ago, it was an order of magnitudes slower than VSE for float ops (not
matrices). I may be faster now that they have a jitter.

Scanning your code, you have the same issues that we had. As has been mentioned,
every time you have an operation that has an intermediate result, you have a
object allocation problem.

> connections do: [:con|
> v1 _ positions at: (con node1).
> v2 _ positions at: (con node2).
> dv _ v2 clone. dv -= v1.

Here you clonning something in the loop. Try allocating the temp outside the
loop, and reuse it each iteration by filling it with zeros at the start of the
loop. We had a Fortran primitive to fill arrays with a constant value.

> r2 _ con restLength * con restLength.
> "fast square-root aproximation:"
> dv *= (r2 / ((dv dot: dv) + r2) - 0.5).

If this fast square-root approximation is good, then move it into Fortran/c as a
primitive. I may be wrong, but the last time I glanced at the *= method on
Matrix, it was a primitive only when both arguments are Matrixes. Otherwise it
is implemented in Squeak

> v1 -= dv. v2 += dv.
> positions at: (con node1) put: v1.
> positions at: (con node2) put: v2.
> ].
> ].

After we did all these tweaks, we found that the only bottleneck was unpacking
the state vector before doing the 'real' calc, and re-packing the state vector
afterwards. We tried rewritting those in Fortran, but it didn't help. The
problem there was the overhead in setting up a DLL call.

The other thing you have to look out for is whether Squeak does an marshalling
of data before or after a primitive call. VSE coppied arguments into a buffer
before a DLL call, and then coppies them back afterwards, which slows things
down when you have a 1000*1000 matrix! We solved that problem by creating a
special matrix class that allocated space on the C heap, and only passed the
heap address across.

To sumarize: linear calcs are fine in Smalltalk. Nonlinear stuff has to go into
Fortran (eventually). at: and at:put: are a pain!

I hope that helps

Cheers

Daniel


Reply | Threaded
Open this post in threaded view
|

Re: Floating point performance again

news.gmane.org-2
daniel poon <mr.d.poon <at> gmail.com> writes:

> After we did all these tweaks, we found that the only bottleneck was unpacking
> the state vector before doing the 'real' calc, and re-packing the state vector
> afterwards. We tried rewritting those in Fortran, but it didn't help. The
> problem there was the overhead in setting up a DLL call.

I forgot to mention that all our calculations were performed as a callback from
an ODE (ordinary differential equation) solver written in Fortran, and the
mechanics of the callback in VSE was a very very significant overhead. Hence my
thread elsewhere about callbacks in Squeak.

Cheers

Daniel