Smalltalk › Squeak › Croquet › Croquet - Dev

[croquet] [kragen@pobox.com: Smalltalk performance and Moore's Law]

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

Eugen Leitl

[croquet] [kragen@pobox.com: Smalltalk performance and Moore's Law]

----- Forwarded message from Kragen Javier Sitaker <[hidden email]> -----

From: Kragen Javier Sitaker <[hidden email]>
Date: Mon, 5 Mar 2007 03:37:02 -0500 (EST)
To: [hidden email]
Subject: Smalltalk performance and Moore's Law
User-Agent: Mutt/1.5.9i

Previous version posted at
http://lambda-the-ultimate.org/node/531#comment-23457 on 2006-12-25.

This is a partial rebuttal to Alan Kay's occasional assertion that
computers aren't nearly as much faster at executing late-bound things
like Smalltalk as you would expect from Moore's Law.

In an interview with ACM Queue, Kay writes [7]:

Just as an aside, to give you an interesting benchmark --- on
roughly the same system, roughly optimized the same way, a
benchmark from 1979 at Xerox PARC runs only 50 times faster
today. Moore’s law has given us somewhere between 40,000 and
60,000 times improvement in that time. So there’s approximately a
factor of 1,000 in efficiency that has been lost by bad CPU
architectures.

But Moore's Law is about price-performance, not absolute performance;
here I estimate that the actual loss of price-performance attributable
to bad CPU architectures is perhaps a factor of 10 to 50, and it is
plausible that better compilers can remedy this.

Guesswork
=========

"Resuna" writes [6]:

The [VAX] 11/780 was 3.6 MHz, 32-bit words. I don't know how fast
the Alto or Dorado were, but with the Dorado being the
archetypical "3M" machine I assume its performance was comparable
to a nominally 1-MIPS 11/780.

According to Wikipedia [0], the Dorado was an all-ECL machine. The
abstract to Lampson and Pier's paper on the Dorado [1], which I
haven't read, says it ran at 20MHz, had 16 hardware threads to provide
zero-context task switching, and was built out of "approximately 3000
MSI [ECL] components". So it was considerably faster than a VAX.
Maybe one of the older D-machines is "the archetypal 3M-machine".

Apparently it could run 200k-400k Smalltalk bytecodes per second [2].
I'm guessing that the Dorado is the particular machine Kay was
alluding to benchmarking, since it was introduced in 1979, and the
context of the conversation is how machines designed to be efficient
at high-level language execution were worthwhile.

I don't think it was ever sold commercially (or even mass-produced
in-house), which makes per-unit costs difficult to calculate.
However, if we assume that each of the 3000 chips in the thing cost
$20 each (unfortunately I have no real idea how much ECL chips cost in
1980), that's a $60 000 bill-of-materials cost. So it might have cost
$100 000 per machine if it had been mass-produced, and since it was
ECL, the electrical power cost of running it would likely be higher
per chip as well.

According to the squeak-dev thread on the subject [3], modern 600MHz
uniprocessors are about 20x the speed of the Dorado when running
Squeak, or 35 million bytecodes per second (which sounds more like
100x the speed of the Dorado, actually).

However, the uniprocessors in question cost US$150 or so, which is
inflation-equivalent to maybe US$75 in 1980 dollars. (They also
include hundreds of megabytes of RAM, instead of the 8MB on the
Dorado.)

If you were going to spend $100 000 today (or when Kay gave this
interview) on a computer to run Smalltalk on, you would probably get a
Beowulf of 50 nodes, each node of which could run bytecodes at 50 to
200 times the speed of a Dorado, and that's running Squeak, which is
not designed to be a particularly high-performance Smalltalk. But
Moore's Law has still given us, by my rough estimates, a factor of
2500 to 10 000 in price/performance in this case. (That's not
counting the difference between 8 megs of RAM and 50 000 megs of RAM,
or the advantage of having 10TB of disk, etc.) A factor of 2500 is
still noticeably less than the 131072x improvement that you might
predict from a naive application of Moore's law, but the remaining
factor of 10-50 is probably explicable in terms of Kay's explanation:
the architecture is not optimized for Smalltalk bytecode execution, so
you get a 10-50x slowdown when you use it as if it were a Dorado.

(You might be able to get a Beowulf of 300 nodes at that price,
depending on other circumstances.)

How much faster are other Smalltalk implementations than Squeak?
Various microbenchmarks seem to peg Strongtalk at 3x-10x faster than
Squeak (Avi Bryant's [4], David Griswold/Klaus Witzel's [5]), which
would nicely compensate for the remainder of Kay's complaint.

References
==========

[0] Wikipedia article "Xerox Alto", section "Diffusion and Evolution",
as of 2006-12-25
> http://en.wikipedia.org/wiki/Xerox_Alto#Diffusion_and_evolution

[1] "A Processor for a High-Performance Personal Computer", from
Butler W. Lampson and Kenneth A. Pier, Xerox PARC, 1980, IEEE
"CH1494-4/80/0000-0146" (whatever that means), 15 pp.; mentions, among
other things, that the first machine "came up in the spring of 1979".
> http://research.microsoft.com/Lampson/24-DoradoProcessor/Acrobat.pdf

[2] Squeak-dev post "Dorado bytecodes per second", from Bruce ONeel
(edoneel at sdf.lonestar.org), 2005-05-28T16:41:49 CEST, quoting
previous post from Jecel Assumpcao Jr (jecel at merlintec.com):

By running the benchmarks for the "green book" and doing a lot of rough
extrapolations, my guess is that the Dorado would get between 200K and
400K bytecodes/sec.

And followup from Tim Rowledge (tim at rowledge.org):

That is pretty much what I remember as the claim for Dorados.

> http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-April/091211.html

[3] Squeak-dev post "Dorado bytecodes per second", from Jecel
Assumpcao Jr (jecel at merlintec.com), 2005-05-28T22:38:19 CEST ---
he's talking about 600MHz ARMs.
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-April/091215.html

[4] Blog post "Ruby and Strongtalk II", by Avi Bryant, on his blog
"HREF Considered Harmful"; the microbenchmark in question did a
billion accesses of a thousand-element array of small integers, took
0.7 seconds in Java, 7 seconds in Strongtalk, 70 seconds in Squeak, or
16 if you use Array instead of ByteArray.
> http://smallthought.com/avi/?p=17

[5] Squeak-dev thread "Thue-Morse and performance: Squeak
v.s. Strongtalk v.s. VisualWorks", started by Klaus D. Witzel
2006-12-17; several people, including David Griswold, point out flaws
in Witzel's initial benchmark, and the results are interesting.
> http://www.nabble.com/Thue-Morse-and-performance:-Squeak-v.s.-Strongtalk-v.s.-VisualWorks-t2834773.html

[6] Comment "I still want to see Kay's benchmark...", from "Resuna",
2005-07-22
> http://lambda-the-ultimate.org/node/531#comment-7895

[7] ACM Queue article "A Conversation with Alan Kay: Big Talk with the
creator of Smalltalk --- and much more.", by Stuart Feldman and Alan
Kay, vol. 2, no. 9, Dec/Jan 2004-2005, is the origin of this quote.
> http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=273&page=3

----- End forwarded message -----
--
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE

signature.asc (198 bytes) Download Attachment

David P. Reed

Re: [croquet] [kragen@pobox.com: Smalltalk performance and Moore's Law]

The Dorado was probably not the machine alluded to in 1979. The
Dandelion (which was the Xerox STAR machine, and was commercially sold)
was a much more likely candidate, IMHO. There was a mid-scale (what we
would now call a "workstation class" machine) in the D-system "family"
that was used largely for Common LISP development, also.

Not that it matters, but I take strong issue with the notion cited that
Moore's Law defined "price performance" laws. That may be how some refer
to it now, but Moore's Law refers to gate density, line width, and other
parameters that don't refer to price - merely achieved engineering
metrics vs. time. In fact, price arises in the silicon industry in a way
that is so dominated by volume of production that there is no reasonable
way to predict "price vs. performance" without knowing the market growth
path - something Gordon Moore did not attempt to predict.

Eugen Leitl wrote:

> ----- Forwarded message from Kragen Javier Sitaker <[hidden email]> -----
>
> From: Kragen Javier Sitaker <[hidden email]>
> Date: Mon, 5 Mar 2007 03:37:02 -0500 (EST)
> To: [hidden email]
> Subject: Smalltalk performance and Moore's Law
> User-Agent: Mutt/1.5.9i
>
> Previous version posted at
> http://lambda-the-ultimate.org/node/531#comment-23457 on 2006-12-25.
>
> This is a partial rebuttal to Alan Kay's occasional assertion that
> computers aren't nearly as much faster at executing late-bound things
> like Smalltalk as you would expect from Moore's Law.
>
> In an interview with ACM Queue, Kay writes [7]:
>
> Just as an aside, to give you an interesting benchmark --- on
> roughly the same system, roughly optimized the same way, a
> benchmark from 1979 at Xerox PARC runs only 50 times faster
> today. Moore’s law has given us somewhere between 40,000 and
> 60,000 times improvement in that time. So there’s approximately a
> factor of 1,000 in efficiency that has been lost by bad CPU
> architectures.
>
> But Moore's Law is about price-performance, not absolute performance;
> here I estimate that the actual loss of price-performance attributable
> to bad CPU architectures is perhaps a factor of 10 to 50, and it is
> plausible that better compilers can remedy this.
>
> Guesswork
> =========
>
> "Resuna" writes [6]:
>
> The [VAX] 11/780 was 3.6 MHz, 32-bit words. I don't know how fast
> the Alto or Dorado were, but with the Dorado being the
> archetypical "3M" machine I assume its performance was comparable
> to a nominally 1-MIPS 11/780.
>
> According to Wikipedia [0], the Dorado was an all-ECL machine. The
> abstract to Lampson and Pier's paper on the Dorado [1], which I
> haven't read, says it ran at 20MHz, had 16 hardware threads to provide
> zero-context task switching, and was built out of "approximately 3000
> MSI [ECL] components". So it was considerably faster than a VAX.
> Maybe one of the older D-machines is "the archetypal 3M-machine".
>
> Apparently it could run 200k-400k Smalltalk bytecodes per second [2].
> I'm guessing that the Dorado is the particular machine Kay was
> alluding to benchmarking, since it was introduced in 1979, and the
> context of the conversation is how machines designed to be efficient
> at high-level language execution were worthwhile.
>
> I don't think it was ever sold commercially (or even mass-produced
> in-house), which makes per-unit costs difficult to calculate.
> However, if we assume that each of the 3000 chips in the thing cost
> $20 each (unfortunately I have no real idea how much ECL chips cost in
> 1980), that's a $60 000 bill-of-materials cost. So it might have cost
> $100 000 per machine if it had been mass-produced, and since it was
> ECL, the electrical power cost of running it would likely be higher
> per chip as well.
>
> According to the squeak-dev thread on the subject [3], modern 600MHz
> uniprocessors are about 20x the speed of the Dorado when running
> Squeak, or 35 million bytecodes per second (which sounds more like
> 100x the speed of the Dorado, actually).
>
> However, the uniprocessors in question cost US$150 or so, which is
> inflation-equivalent to maybe US$75 in 1980 dollars. (They also
> include hundreds of megabytes of RAM, instead of the 8MB on the
> Dorado.)
>
> If you were going to spend $100 000 today (or when Kay gave this
> interview) on a computer to run Smalltalk on, you would probably get a
> Beowulf of 50 nodes, each node of which could run bytecodes at 50 to
> 200 times the speed of a Dorado, and that's running Squeak, which is
> not designed to be a particularly high-performance Smalltalk. But
> Moore's Law has still given us, by my rough estimates, a factor of
> 2500 to 10 000 in price/performance in this case. (That's not
> counting the difference between 8 megs of RAM and 50 000 megs of RAM,
> or the advantage of having 10TB of disk, etc.) A factor of 2500 is
> still noticeably less than the 131072x improvement that you might
> predict from a naive application of Moore's law, but the remaining
> factor of 10-50 is probably explicable in terms of Kay's explanation:
> the architecture is not optimized for Smalltalk bytecode execution, so
> you get a 10-50x slowdown when you use it as if it were a Dorado.
>
> (You might be able to get a Beowulf of 300 nodes at that price,
> depending on other circumstances.)
>
> How much faster are other Smalltalk implementations than Squeak?
> Various microbenchmarks seem to peg Strongtalk at 3x-10x faster than
> Squeak (Avi Bryant's [4], David Griswold/Klaus Witzel's [5]), which
> would nicely compensate for the remainder of Kay's complaint.
>
> References
> ==========
>
> [0] Wikipedia article "Xerox Alto", section "Diffusion and Evolution",
> as of 2006-12-25
>
>> http://en.wikipedia.org/wiki/Xerox_Alto#Diffusion_and_evolution
>>
>
> [1] "A Processor for a High-Performance Personal Computer", from
> Butler W. Lampson and Kenneth A. Pier, Xerox PARC, 1980, IEEE
> "CH1494-4/80/0000-0146" (whatever that means), 15 pp.; mentions, among
> other things, that the first machine "came up in the spring of 1979".
>
>> http://research.microsoft.com/Lampson/24-DoradoProcessor/Acrobat.pdf
>>
>
> [2] Squeak-dev post "Dorado bytecodes per second", from Bruce ONeel
> (edoneel at sdf.lonestar.org), 2005-05-28T16:41:49 CEST, quoting
> previous post from Jecel Assumpcao Jr (jecel at merlintec.com):
>
> By running the benchmarks for the "green book" and doing a lot of rough
> extrapolations, my guess is that the Dorado would get between 200K and
> 400K bytecodes/sec.
>
> And followup from Tim Rowledge (tim at rowledge.org):
>
> That is pretty much what I remember as the claim for Dorados.
>
>
>> http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-April/091211.html
>>
>
> [3] Squeak-dev post "Dorado bytecodes per second", from Jecel
> Assumpcao Jr (jecel at merlintec.com), 2005-05-28T22:38:19 CEST ---
> he's talking about 600MHz ARMs.
>
>> http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-April/091215.html
>>
>
> [4] Blog post "Ruby and Strongtalk II", by Avi Bryant, on his blog
> "HREF Considered Harmful"; the microbenchmark in question did a
> billion accesses of a thousand-element array of small integers, took
> 0.7 seconds in Java, 7 seconds in Strongtalk, 70 seconds in Squeak, or
> 16 if you use Array instead of ByteArray.
>
>> http://smallthought.com/avi/?p=17
>>
>
> [5] Squeak-dev thread "Thue-Morse and performance: Squeak
> v.s. Strongtalk v.s. VisualWorks", started by Klaus D. Witzel
> 2006-12-17; several people, including David Griswold, point out flaws
> in Witzel's initial benchmark, and the results are interesting.
>
>> http://www.nabble.com/Thue-Morse-and-performance:-Squeak-v.s.-Strongtalk-v.s.-VisualWorks-t2834773.html
>>
>
> [6] Comment "I still want to see Kay's benchmark...", from "Resuna",
> 2005-07-22
>
>> http://lambda-the-ultimate.org/node/531#comment-7895
>>
>
> [7] ACM Queue article "A Conversation with Alan Kay: Big Talk with the
> creator of Smalltalk --- and much more.", by Stuart Feldman and Alan
> Kay, vol. 2, no. 9, Dec/Jan 2004-2005, is the origin of this quote.
>
>> http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=273&page=3
>>
>
> ----- End forwarded message -----
>

David P. Reed

Re: [croquet] [kragen@pobox.com: Smalltalk performance and Moore's Law]

In reply to this post by Eugen Leitl

Jecel Assumpcao Jr

Re: [croquet] [kragen@pobox.com: Smalltalk performance and Moore's Law]

In reply to this post by David P. Reed

David P. Reed wrote:
> The Dorado was probably not the machine alluded to in 1979. The
> Dandelion (which was the Xerox STAR machine, and was commercially sold)
> was a much more likely candidate, IMHO. There was a mid-scale (what we
> would now call a "workstation class" machine) in the D-system "family"
> that was used largely for Common LISP development, also.

Actually "Dorados" was a popular unit of performance in the Smalltalk
papers from the 1980s so I would expect that this was the machine Alan
was thinking of. Can he tell us if this is correct, of course.

> Not that it matters, but I take strong issue with the notion cited that
> Moore's Law defined "price performance" laws. That may be how some refer
> to it now, but Moore's Law refers to gate density, line width, and other
> parameters that don't refer to price - merely achieved engineering
> metrics vs. time. In fact, price arises in the silicon industry in a way
> that is so dominated by volume of production that there is no reasonable
> way to predict "price vs. performance" without knowing the market growth
> path - something Gordon Moore did not attempt to predict.

The idea of following Moore's law in either the constant performance or
the constant price curves can be found in the 1978 "Computer
Engineering" by C. Gordon Bell, J. Craig Mudge and John E. McNamara.
This is a very interesting book and is available online:

http://research.microsoft.com/~gbell/Computer_Engineering/index.html

So if we look at the graph in page 35 (derived from the one on page 30
showing Moore's law)

http://research.microsoft.com/~gbell/Computer_Engineering/00000035.htm

we can start at 1 Dorado @ $100K in 1980 and follow the constant
performance curve to a 1 Dorado @ $781 machine in 1991, and then follow
the constant price curve to a 1024 Dorado @ $781 computer in 2006.
Current high end PCs run Squeak at 400 to 800 Dorados, which is close
enough.

Given that you lose a factor of 10 in performance when you implement a
processor in FPGAs instead of custom chips and that I have invested
everything I ever had in building FPGA based Smalltalk computers, the
issue of whether there is a significant inefficiency in the most popular
architectures used today is a particularly vital one for me. Looking at
Kragen Javier Sitaker's evaluation it seems very reasonable to me and
pretty close to the numbers I had reached myself (see previous
paragraph). I am still optimistic about my work but have no expectations
of getting 100 times the performance of a Pentium M - that is for my
next project ;-)

-- Jecel