Suggestion for a performance analysis project, suitable for a masters student

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Suggestion for a performance analysis project, suitable for a masters student

Eliot Miranda-2
Hi All,

    I had occasion to compare VW (vw7.7nc) and Spur recently and was pleasantly surprised to see that Spur is on average -40% faster than VW on a large subset of the benchmarks from the shootout (I didn't include three because of Regex syntax issues).  Now Spur gets some of its speed from having direct pointers vs VisualWorks' object header/table indirection, but it could get other speedups from various other differences.  It would be great to know exactly how much speedup comes from what, and indeed how much cost Sour pays for its lazy become:.  I'd like to propose a project to exactly determine the costs of an explicit read barrier and of lazy forwarding compared to no check at all.

Spur, part of VMMaker.oscog, is implemented by a hierarchy of classes that implement a 32-bit and a 64-bit memory manager.  Spur is a sibling to the old ObjectMemory class that implements the V3 object representation.  The current Spur does "lazy forwarding" where two objects are become by cloning each object, making the old versions point to the opposite copy, and relying on send-time checks to lazily follow forwarding pointers when sends to forwarded objects fail a message lookup.

The project would create two additional variations on Spur, both of which dispense with the lazy check.  One would explicitly test for a forwarding pointers on every access, and one would never check, not need send-time checking either and would reimplement become: like the old ObjectMemory, by scanning the entire heap to exchange all references.

These variations could be implemented as subclasses or siblings of Spur.  The project isn't trivial because it also means making changes to the JIT, albeit within its framework for multiple object representations.  Because this is probably some months' work I can't do it myself but I'm extremely interested in the results and I think it would make a really good paper, e.g. for ISMM, or one of the dynamic language workshops.

If what I've said makes any sense at all to any academics out there who are looking for a challenging but nicely contained and far from open ended Masters project then please get in touch and we can see if we can take this any further.

_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Shootout Regex benchmarks (was: Re: [squeak-dev] Suggestion for a performance analysis project, suitable for a masters student)

Levente Uzonyi-2
Hi Eliot,

I made many of the Shootout tests work[1] (including those which rely on
Regex), and rewrote some of them to use Squeak-specific optimizations
instead of VW-specific ones.
I know the latter may be something you don't want to have, but the Regex
version in the Inbox[2] is (or at least it should be) compatible with the
Shootout benchmarks.

Evaluating

  ShootoutTests runAllToDummyStreamVs: ShootoutTests referenceTimesForVW

I get the following:

{[self binarytrees: 16 to: stream]}
  took 1.441 seconds
ratio: 0.597   % change: -40.331%

{[self chameneosredux: 600000 to: stream]}
  took 2.083 seconds
ratio: 0.42   % change: -58.038%

{[self fannkuchRedux: 10 to: stream]}
  took 3.362 seconds
ratio: 0.896   % change: -10.371%

{[self fasta: 2500000 to: stream]}
  took 5.794 seconds
ratio: 1.199   % change: 19.934%

{[self fastaRedux: 2500000 to: stream]}
  took 2.281 seconds
ratio: 0.627   % change: -37.318%

{[self mandelbrot3: 1000 to: stream]}
  took 1.26 seconds
ratio: 0.311   % change: -68.943%

{[self meteor: 2098 to: stream]}
  took 0.415 seconds
ratio: 0.769   % change: -23.148%

{[self nbody: 500000 to: stream]}
  took 1.076 seconds
ratio: 0.332   % change: -66.8%

{[self
  pidigitsTo: 2000
  width: 10
  to: stream]}
  took 0.859 seconds
ratio: 0.795   % change: -20.463%

{[self regexDNA: fasta50000 to: stream]}
  took 3.729 seconds
ratio: 0.533   % change: -46.668%

{[self reverseComplement: fasta2500000 to: stream]}
  took 0.177 seconds
ratio: 0.058   % change: -94.231%

{[self spectralnorm: 500]}
  took 0.594 seconds
ratio: 0.607   % change: -39.264%

{[self threadring: 5000000 to: stream]}
  took 1.655 seconds
ratio: 0.341   % change: -65.89%
geometric mean '0.478'   average speedup '-52.212'%

This comparison is unfair of course, because the reference numbers are
from a different machine and they use a dummy stream, but it's still
useful to make estimates.

Levente

[1] http://leves.web.elte.hu/squeak/Shootout-ul.18.mcz
[2] http://source.squeak.org/inbox/Regex-Core-ul.37.mcz

On Thu, 13 Aug 2015, Eliot Miranda wrote:

> Hi All,
>     I had occasion to compare VW (vw7.7nc) and Spur recently and was pleasantly surprised to see that Spur is on average -40% faster than VW on a large subset of the benchmarks from the shootout (I didn't
> include three because of Regex syntax issues).  Now Spur gets some of its speed from having direct pointers vs VisualWorks' object header/table indirection, but it could get other speedups from various other
> differences.  It would be great to know exactly how much speedup comes from what, and indeed how much cost Sour pays for its lazy become:.  I'd like to propose a project to exactly determine the costs of an
> explicit read barrier and of lazy forwarding compared to no check at all.
>
> Spur, part of VMMaker.oscog, is implemented by a hierarchy of classes that implement a 32-bit and a 64-bit memory manager.  Spur is a sibling to the old ObjectMemory class that implements the V3 object
> representation.  The current Spur does "lazy forwarding" where two objects are become by cloning each object, making the old versions point to the opposite copy, and relying on send-time checks to lazily
> follow forwarding pointers when sends to forwarded objects fail a message lookup.
>
> The project would create two additional variations on Spur, both of which dispense with the lazy check.  One would explicitly test for a forwarding pointers on every access, and one would never check, not
> need send-time checking either and would reimplement become: like the old ObjectMemory, by scanning the entire heap to exchange all references.
>
> These variations could be implemented as subclasses or siblings of Spur.  The project isn't trivial because it also means making changes to the JIT, albeit within its framework for multiple object
> representations.  Because this is probably some months' work I can't do it myself but I'm extremely interested in the results and I think it would make a really good paper, e.g. for ISMM, or one of the
> dynamic language workshops.
>
> If what I've said makes any sense at all to any academics out there who are looking for a challenging but nicely contained and far from open ended Masters project then please get in touch and we can see if we
> can take this any further.
>
> _,,,^..^,,,_
> best, Eliot
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Shootout Regex benchmarks (was: Re: [squeak-dev] Suggestion for a performance analysis project, suitable for a masters student)

Eliot Miranda-2
Levente,

On Thu, Aug 13, 2015 at 1:35 PM, Levente Uzonyi <[hidden email]> wrote:
Hi Eliot,

I made many of the Shootout tests work[1] (including those which rely on Regex), and rewrote some of them to use Squeak-specific optimizations instead of VW-specific ones.
I know the latter may be something you don't want to have, but the Regex version in the Inbox[2] is (or at least it should be) compatible with the Shootout benchmarks.

Great!  Where were you yesterday ? :-).  I spent yesterday getting my own version of these somewhat working only to find that you've done the complete job.  Thanks you!!

BTW, here's the data I get on a 2.3GHz Core i7 MacMini comparing the latest Spur against vw7.7

#( 46.25857966666666 "total time to run suite on 7.7nc"
26.88 "total time to run suite on Spur r3420"
1.7209293030753967 "ratio Spur vs 7.7"
-41.89186050740468 "% speedup of Spur relative to 7.7"
"benchmark"  "ratio" "% change, new - old / old * 100"
#( #(#binarytrees 1.99 -49.77)
#(#chameneosredux2 3.52 -71.62)
#(#fannkuchredux 0.82 21.59)
#(#fastaredux 1.27 -21.22)
#(#mandelbrot2 2.6 -61.54)
#(#meteor 1.09 -8.26)
#(#pidigits4 1.4 -28.54)
#(#spectralnorm2 1.20 -16.76)))

Good numbers :-)


Evaluating

        ShootoutTests runAllToDummyStreamVs: ShootoutTests referenceTimesForVW

I get the following:

{[self binarytrees: 16 to: stream]}
 took 1.441 seconds
ratio: 0.597   % change: -40.331%

{[self chameneosredux: 600000 to: stream]}
 took 2.083 seconds
ratio: 0.42   % change: -58.038%

{[self fannkuchRedux: 10 to: stream]}
 took 3.362 seconds
ratio: 0.896   % change: -10.371%

{[self fasta: 2500000 to: stream]}
 took 5.794 seconds
ratio: 1.199   % change: 19.934%

{[self fastaRedux: 2500000 to: stream]}
 took 2.281 seconds
ratio: 0.627   % change: -37.318%

{[self mandelbrot3: 1000 to: stream]}
 took 1.26 seconds
ratio: 0.311   % change: -68.943%

{[self meteor: 2098 to: stream]}
 took 0.415 seconds
ratio: 0.769   % change: -23.148%

{[self nbody: 500000 to: stream]}
 took 1.076 seconds
ratio: 0.332   % change: -66.8%

{[self
                pidigitsTo: 2000
                width: 10
                to: stream]}
 took 0.859 seconds
ratio: 0.795   % change: -20.463%

{[self regexDNA: fasta50000 to: stream]}
 took 3.729 seconds
ratio: 0.533   % change: -46.668%

{[self reverseComplement: fasta2500000 to: stream]}
 took 0.177 seconds
ratio: 0.058   % change: -94.231%

{[self spectralnorm: 500]}
 took 0.594 seconds
ratio: 0.607   % change: -39.264%

{[self threadring: 5000000 to: stream]}
 took 1.655 seconds
ratio: 0.341   % change: -65.89%
geometric mean '0.478'   average speedup '-52.212'%

This comparison is unfair of course, because the reference numbers are from a different machine and they use a dummy stream, but it's still useful to make estimates.

Yes.  Testing to the dummy stream is useful for profiling where one wants to focus on the cost of the algorithm.  But having the framework be flexible is useful.  Especially making it easy to compare against different baselines, e.g. VW, Cog V3, Interpreter V3, etc,
 

Levente

[1] http://leves.web.elte.hu/squeak/Shootout-ul.18.mcz
[2] http://source.squeak.org/inbox/Regex-Core-ul.37.mcz

Great, thanks!
 


On Thu, 13 Aug 2015, Eliot Miranda wrote:

Hi All,
    I had occasion to compare VW (vw7.7nc) and Spur recently and was pleasantly surprised to see that Spur is on average -40% faster than VW on a large subset of the benchmarks from the shootout (I didn't
include three because of Regex syntax issues).  Now Spur gets some of its speed from having direct pointers vs VisualWorks' object header/table indirection, but it could get other speedups from various other
differences.  It would be great to know exactly how much speedup comes from what, and indeed how much cost Sour pays for its lazy become:.  I'd like to propose a project to exactly determine the costs of an
explicit read barrier and of lazy forwarding compared to no check at all.

Spur, part of VMMaker.oscog, is implemented by a hierarchy of classes that implement a 32-bit and a 64-bit memory manager.  Spur is a sibling to the old ObjectMemory class that implements the V3 object
representation.  The current Spur does "lazy forwarding" where two objects are become by cloning each object, making the old versions point to the opposite copy, and relying on send-time checks to lazily
follow forwarding pointers when sends to forwarded objects fail a message lookup.

The project would create two additional variations on Spur, both of which dispense with the lazy check.  One would explicitly test for a forwarding pointers on every access, and one would never check, not
need send-time checking either and would reimplement become: like the old ObjectMemory, by scanning the entire heap to exchange all references.

These variations could be implemented as subclasses or siblings of Spur.  The project isn't trivial because it also means making changes to the JIT, albeit within its framework for multiple object
representations.  Because this is probably some months' work I can't do it myself but I'm extremely interested in the results and I think it would make a really good paper, e.g. for ISMM, or one of the
dynamic language workshops.

If what I've said makes any sense at all to any academics out there who are looking for a challenging but nicely contained and far from open ended Masters project then please get in touch and we can see if we
can take this any further.

_,,,^..^,,,_
best, Eliot







--
_,,,^..^,,,_
best, Eliot