I found the subheading for the following press release funny in the
context of Squeak: http://www.lsi.com/news/product_news/2006_11_13.html "ZEVIO 1020 processor provides best-in-class cost, power and performance for the expanding digital consumer appliance market, including eToys and portable navigation devices" Though a 150MHz ARM9 would make for a rather slow Squeak machine (even with all the neat graphics coprocessors it includes), at $8 (large volumes) we can forgive this :-) -- Jecel |
Hi Jecel,
nice find, perhaps they learned the eToys name from the OLPC project ;-) Question to the VM guru: are there any opcodes in the instruction set of ARM9 which allow a *fast* VM? A *small* VM is addressed by the thumb thing, IIRC. But what about speed. And what VM technology, (direct) threaded bytecode, or what does ARM9 support best. Thank you for your time. /Klaus On Tue, 14 Nov 2006 15:58:57 +0100, Jecel Assumpcao Jr wrote: > I found the subheading for the following press release funny in the > context of Squeak: > > http://www.lsi.com/news/product_news/2006_11_13.html > > "ZEVIO 1020 processor provides best-in-class cost, power and performance > for the expanding digital consumer appliance market, including eToys and > portable navigation devices" > > Though a 150MHz ARM9 would make for a rather slow Squeak machine (even > with all the neat graphics coprocessors it includes), at $8 (large > volumes) we can forgive this :-) > > -- Jecel > > |
The most severe drawback of ARM is lacking floating point support.
That's one of the major reasons OLPC went with the Geode and not ARM. - Bert - On Nov 14, 2006, at 15:44 , Klaus D. Witzel wrote: > Hi Jecel, > > nice find, perhaps they learned the eToys name from the OLPC > project ;-) > > Question to the VM guru: are there any opcodes in the instruction > set of ARM9 which allow a *fast* VM? A *small* VM is addressed by > the thumb thing, IIRC. But what about speed. And what VM > technology, (direct) threaded bytecode, or what does ARM9 support > best. > > Thank you for your time. > > /Klaus > > On Tue, 14 Nov 2006 15:58:57 +0100, Jecel Assumpcao Jr wrote: > >> I found the subheading for the following press release funny in the >> context of Squeak: >> >> http://www.lsi.com/news/product_news/2006_11_13.html >> >> "ZEVIO 1020 processor provides best-in-class cost, power and >> performance >> for the expanding digital consumer appliance market, including >> eToys and >> portable navigation devices" >> >> Though a 150MHz ARM9 would make for a rather slow Squeak machine >> (even >> with all the neat graphics coprocessors it includes), at $8 (large >> volumes) we can forgive this :-) >> >> -- Jecel >> >> > > > |
In reply to this post by Klaus D. Witzel
On 14-Nov-06, at 6:44 AM, Klaus D. Witzel wrote: > Hi Jecel, > > nice find, perhaps they learned the eToys name from the OLPC > project ;-) > > Question to the VM guru: are there any opcodes in the instruction > set of ARM9 which allow a *fast* VM? A *small* VM is addressed by > the thumb thing, IIRC. But what about speed. And what VM > technology, (direct) threaded bytecode, or what does ARM9 support > best. one variation (actually ARM9 is a family designation for a range of actual chips) among many. Things that are good for running a squeakish vm include good fast integer handling, a barrel shifter, and perhaps most interesting the very fast call / return from subroutines that can reduce the use of inlining and save memory traffic. The fast interrupt handling can be useful too. bad things include the almost non-existent cache and no FPU. I still think that a maxed out ARM11 with the fully 4mb cache and 4mb TCM and the FPU would be a very fast system. But nobody is banging onmy door to offer me one.:-( tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Strange OpCodes: MAW: Make Aggravating Whine |
In reply to this post by Klaus D. Witzel
Klaus D. Witzel wrote:
> nice find, perhaps they learned the eToys name from the OLPC project ;-) This is a rather obvious name - I remember two companies fighting over it back in the .com era. > Question to the VM guru: are there any opcodes in the instruction set of > ARM9 which allow a *fast* VM? A *small* VM is addressed by the thumb > thing, IIRC. But what about speed. And what VM technology, (direct) > threaded bytecode, or what does ARM9 support best. This chip has the Jazelle 1 technology, which is unfortunately hardwired for Java. The most common bytecodes get translated on the fly into a single ARM instruction (just like Thumb) while the more complex ones trap into a software implementation. An equivalent circuit for handling Squeak bytecodes would be fantastic. Though it lacks the register windows I like so much, the ARM has been my favorite RISC instruction set since I studied it 21 years ago. Only in the last couple of months have I been able to come up with something I feel is nicer (http://www.merlintec.com:8080/Hardware/RISC42). Two things that make the ARM great for implementing stuff like Squeak are the ability to use shifts and rotates with every instruction (allowing up to four registers to be specified instead of just three) and the ability to conditionally execute any instruction. The latter reduces the cost of doing quick stuff like clearing tags. Bert Freudenberg wrote: > The most severe drawback of ARM is lacking floating point support. > That's one of the major reasons OLPC went with the Geode and not ARM. That is the official story, but I would guess that the fact that AMD was funding the project was a more important reason. Consider the ARM7500FE from the late 1990s, for example (click on "features"): http://www.cirrus.com/en/products/pro/detail/P940.html Its hardware floating point implementation was a good match for its integer performance (both very weak by today's standards). The truth is that most ARM customers don't care about floating point and so most off the shelf variations don't include it. Though OLPC had a minimum projected volume of tens of millions and is an extremely cost sensitive project, they decided early on to limit themselves to chips that were already in use in other high volume products. This meant they would avoid any components just being launched and would not design any chips of their own. With that limitation then floating point is indeed a problem for the ARM. But in the end they had to create two custom chips for the project (display controller for the special LCD and the camera and flash controller). Alex Perez pointed out this very cool variation of the ARM that will soon be available in production quantities: > http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=i.MX31&nodeId=01J4Fs2973ZrDR The vector floating point unit seems like a great option which Squeak's FloatArrays could easily be patched to take full advantage of. In contrast the Geode's floating point performance turned out to be only about half of what the OLPC people originally had thought it would be. -- Jecel |
In reply to this post by Klaus D. Witzel
Hi Jecel,
on Tue, 14 Nov 2006 20:49:09 +0100, you wrote: > Klaus D. Witzel wrote: >> nice find, perhaps they learned the eToys name from the OLPC project ;-) > > This is a rather obvious name - I remember two companies fighting over > it back in the .com era. > >> Question to the VM guru: are there any opcodes in the instruction set of >> ARM9 which allow a *fast* VM? A *small* VM is addressed by the thumb >> thing, IIRC. But what about speed. And what VM technology, (direct) >> threaded bytecode, or what does ARM9 support best. > > This chip has the Jazelle 1 technology, Ah, I didn't associate Jazelle with ARM9. I thought that Jazelle was still for the lab guys. > which is unfortunately hardwired > for Java. The most common bytecodes get translated on the fly into a > single ARM instruction (just like Thumb) while the more complex ones > trap into a software implementation. An equivalent circuit for handling > Squeak bytecodes would be fantastic. Well, even the JVM is an Universal Turing machine ;-) I have a Squeak compiler which emits the JVM's bytecodes (into regular class files). The run-time is hand-crafted as a very thin layer (currently around SmallIntegers, Characters and Floats and other basic material like ByteArray, OrderedCollection and BytesWriteStream [a companion of aByteArray>>writeStream] and analog to that String and CharsWriteStream, plus FileDirectory and the bit of refactoring done with Magnitude). Just sufficient support for the compiler compiling itself :) Now every JVM does dynamic dispatch if the correct opcodes are used (O.K. the bytecode verifier has to be convinced but that is only pro-forma and does not affect the run-time; recently Adrian Kuhn had the same idea how to do this; even GCJ knows how to do this :) No "Invokedynamic" is needed (except if that would automagically do boxing/unboxing, then I'd employ "Invokedynamic"). And only field access (instVar, classVar) needs the CAST opcode ;-) Well, I think that especially Jazelle can live without the CAST opcode :) I would like to see this all running on Jazelle but lack the expertise for choosing the right platform for *not* producing an expensive failure. Can you help me with your expertise to choose a Jazelle platform, *that* would be fantastic. /Klaus |
In reply to this post by Klaus D. Witzel
Hi Jecel,
on Tue, 14 Nov 2006 20:49:09 +0100, you wrote: ... >> Question to the VM guru: are there any opcodes in the instruction set of >> ARM9 which allow a *fast* VM? A *small* VM is addressed by the thumb >> thing, IIRC. But what about speed. And what VM technology, (direct) >> threaded bytecode, or what does ARM9 support best. > ... > Though it lacks the register windows I like so much, the ARM has been my > favorite RISC instruction set since I studied it 21 years ago. Only in > the last couple of months have I been able to come up with something I > feel is nicer (http://www.merlintec.com:8080/Hardware/RISC42). Two > things that make the ARM great for implementing stuff like Squeak are > the ability to use shifts and rotates with every instruction (allowing > up to four registers to be specified instead of just three) and the > ability to conditionally execute any instruction. The latter reduces the > cost of doing quick stuff like clearing tags. I like the LOGIC/KLOGIC instruction, it looks like a friend of BitBlt :) How would Smalltalk's method lookup routine look on RISC42? /Klaus |
In reply to this post by Klaus D. Witzel
Klaus D. Witzel wrote on Wed, 15 Nov 2006 05:35:21 +0100
> > This chip has the Jazelle 1 technology, > > Ah, I didn't associate Jazelle with ARM9. I thought that Jazelle was still > for the lab guys. We had a thread about this before. ARM calls two entirely different technologies "Jazelle". I described the older one below and it is present in many commercially available chips. The new one is not out of the labs and is just a special variation of their Thumb instruction set optimized as a target for JIT compilers (so it would be nice for Exupery, for example). > > which is unfortunately hardwired > > for Java. The most common bytecodes get translated on the fly into a > > single ARM instruction (just like Thumb) while the more complex ones > > trap into a software implementation. An equivalent circuit for handling > > Squeak bytecodes would be fantastic. > > Well, even the JVM is an Universal Turing machine ;-) Sure. > I have a Squeak compiler which emits the JVM's bytecodes (into regular > class files). The run-time is hand-crafted as a very thin layer (currently > around SmallIntegers, Characters and Floats and other basic material like > ByteArray, OrderedCollection and BytesWriteStream [a companion of > aByteArray>>writeStream] and analog to that String and CharsWriteStream, > plus FileDirectory and the bit of refactoring done with Magnitude). > > Just sufficient support for the compiler compiling itself :) Very nice. I think there have been other Smalltalks which ran on the JVM. At least I seem to remember something called "Bistro". > Now every JVM does dynamic dispatch if the correct opcodes are used (O.K. > the bytecode verifier has to be convinced but that is only pro-forma and > does not affect the run-time; recently Adrian Kuhn had the same idea how > to do this; even GCJ knows how to do this :) In the case of someone just wanting to use the Jazelle technology for Squeak instead of running it on a regular JVM there is no need to worry about thinking like the bytecode verifier. You can bend the rules as much as you like. > No "Invokedynamic" is needed > (except if that would automagically do boxing/unboxing, then I'd employ > "Invokedynamic"). And only field access (instVar, classVar) needs the CAST > opcode ;-) Well, I think that especially Jazelle can live without the CAST > opcode :) Probably. About the Invokedynamic - give how Jazelle works (essentially a bytecode->ARM translation ROM with all the complicated instructions translated to "call xxx") it is probably no more and nor less costly than the other invoke bytecodes. So some optimizations you might need on a normal JVM might not get the same results here. > I would like to see this all running on Jazelle but lack the expertise for > choosing the right platform for *not* producing an expensive failure. Can > you help me with your expertise to choose a Jazelle platform, *that* would > be fantastic. That depends on many details. Since this is very off topic we should move it off squeak-dev. -- Jecel |
In reply to this post by Klaus D. Witzel
Klaus D. Witzel wrote on Wed, 15 Nov 2006 05:40:06 +0100
> I like the LOGIC/KLOGIC instruction, it looks like a friend of BitBlt :) That is the idea, though perhaps a KSHIFT/LOGIC combination (which the ARM can also handle) would be even more important for the graphics primitives. > How would Smalltalk's method lookup routine look on RISC42? This design was created for running C, not Smalltalk. So the answer is that it would be just the same as on any other generic RISC. That said, it would be nice to run SqueakNOS on this. A teacher at a local university was trying to put together a project for creating an open source RISC code with his students and that inspired me to come up with this, but the idea is that they would do the actual work of implementing it. The designs that I *am* working on are optimized for Smalltalk. The current 16 bit version (http://www.merlintec.com:8080/Hardware/Oliver) essentially has a vast table of all class/selector combinations. Each table entry is two words long and holds the actual instructions for the corresponding method. So you just have to do "call table[class(receiver),selector]" which, not counting cache misses, can be executed in a single clock cycle (the "class(receiver)" function doesn't involve a memory access since I use a variation of the class encoding scheme from Smalltalk-74's OOZE). The bulk of the table entries (over 80%) are for invalid class/selector combinations and their code is just a jump to the MNU routine. Many other entries are for very short methods that fit in the two words. For longer methods the table entry has a jump to the rest of the method (since it is "long" anyway, the overhead of the jump won't be too bad). This scheme wastes memory, but I happen to have 8MB on a machine that would work just fine with 2MB or less. So I have 6MB that would be useless otherwise and wasting them to save a few clocks per message send is a great option. For a 32 bit Smalltalk this wouldn't work. The Squeak 3.8 image I am typing this in has 2339 Metaclasses and 40849 ByteSymbols, and so would need a table with 95 million entries (764MB if each entry has two words of 32 bits each). So for that case I have an alternative called "PIC Mode". I should put a description of this on my swiki, but basically it scales far better and allows type feedback for optimizing compilers but sends take a few more clocks than in the 16 bit version (hopefully inlining eliminates many sends and makes up for that). -- Jecel |
In reply to this post by Klaus D. Witzel
Hi Jecel --
>The designs that I *am* working on are optimized for Smalltalk. The >current 16 bit version (http://www.merlintec.com:8080/Hardware/Oliver) >essentially has a vast table of all class/selector combinations. We called this the "giant hash table scheme" at PARC and thought about using some (future) VM hardware to look up this more useful virtual address for methods. I'd love to hear how this works out (and what kinds of HW would do it better (like a specially programmed FPGA, etc.). Cheers, Alan --------- At 03:50 PM 11/16/2006, Jecel Assumpcao Jr wrote: >Klaus D. Witzel wrote on Wed, 15 Nov 2006 05:40:06 +0100 > > I like the LOGIC/KLOGIC instruction, it looks like a friend of BitBlt :) > >That is the idea, though perhaps a KSHIFT/LOGIC combination (which the >ARM can also handle) would be even more important for the graphics >primitives. > > > How would Smalltalk's method lookup routine look on RISC42? > >This design was created for running C, not Smalltalk. So the answer is >that it would be just the same as on any other generic RISC. That said, >it would be nice to run SqueakNOS on this. A teacher at a local >university was trying to put together a project for creating an open >source RISC code with his students and that inspired me to come up with >this, but the idea is that they would do the actual work of implementing >it. > >The designs that I *am* working on are optimized for Smalltalk. The >current 16 bit version (http://www.merlintec.com:8080/Hardware/Oliver) >essentially has a vast table of all class/selector combinations. Each >table entry is two words long and holds the actual instructions for the >corresponding method. So you just have to do "call >table[class(receiver),selector]" which, not counting cache misses, can >be executed in a single clock cycle (the "class(receiver)" function >doesn't involve a memory access since I use a variation of the class >encoding scheme from Smalltalk-74's OOZE). The bulk of the table entries >(over 80%) are for invalid class/selector combinations and their code is >just a jump to the MNU routine. Many other entries are for very short >methods that fit in the two words. For longer methods the table entry >has a jump to the rest of the method (since it is "long" anyway, the >overhead of the jump won't be too bad). > >This scheme wastes memory, but I happen to have 8MB on a machine that >would work just fine with 2MB or less. So I have 6MB that would be >useless otherwise and wasting them to save a few clocks per message send >is a great option. > >For a 32 bit Smalltalk this wouldn't work. The Squeak 3.8 image I am >typing this in has 2339 Metaclasses and 40849 ByteSymbols, and so would >need a table with 95 million entries (764MB if each entry has two words >of 32 bits each). So for that case I have an alternative called "PIC >Mode". I should put a description of this on my swiki, but basically it >scales far better and allows type feedback for optimizing compilers but >sends take a few more clocks than in the 16 bit version (hopefully >inlining eliminates many sends and makes up for that). > >-- Jecel |
Alan Kay wrote on Fri, 17 Nov 2006 08:15:51 -0800
> >[vast table of all class/selector combinations] > > We called this the "giant hash table scheme" at PARC and thought > about using some (future) VM hardware to look up this more useful > virtual address for methods. In this case I have the full table, not a hash one. The address is virtual in the sense that I have an object table, but otherwise it is pretty much a physical one. When David Ungar saw my talk about this in OOPSLA 2003, where I mentioned I had only gone this route because the smallest memory I could buy cheaply was an 8MB SDRAM chip for $2, he wondered if some papers being presented at that conference wouldn't become irrelevant in the future due to brute force advances in hardware. But I think this particular machine is very atypical and all the great message dispatch theory of the past still applies. In fact, I would recommend that people interested in this read this comparison paper: http://www.cs.ucsb.edu/~urs/oocsb/papers/dispatch.html But for this machine I had 512KB of permanent storage (90KB of that is the bits for programming the FPGA) and 8MB of RAM. So though the system is now Smalltalk it has to be as tiny as the Forth it was originally to fit in the Flash, yet it can waste megabytes of RAM in helper tables to make things a little faster. This isn't the future, just a small niche (a certain kind of embedded applications). Due to the 16 bit object pointers, this system has a very poor reflective structure. Methods aren't objects, for example, but rather all code for a class is lumped into a single vector. This starts out with the corresponding row of the class/selector table and also contains the rest of the methods that don't fit in two words. The selector times 2 directly gives you the initial value for the program counter (which is an index into that vector) for the method. Given how small the system is and how fast current machines are (even this processor at only 54MHz) the implementation isn't very incremental. When you add a new selector to the system all classes are regenerated to expand their table part to make room for this. When the source for a method is edited the whole class and subclasses (the system doesn't actually have these, but these Smalltalk-80 terms give the rough idea) are recompiled. > I'd love to hear how this works out (and what kinds of HW would do it > better (like a specially programmed FPGA, etc.). The 32 bit version will be more interesting since brute force solutions don't work. I have just created a new page to explain the "PIC Mode" (http://www.merlintec.com:8080/hardware/33) but I probably won't be able to put any real content there until next month. The idea is very simple, however: The processor has a normal execution mode and a PIC mode. During normal execution instructions are fetched from the location in the code cache pointed to by the program counter. It doesn't fetch instructions directly from methods - you have to copy the bytecodes (or translate, if the processor doesn't directly execute bytecodes) to the code cache and jump there. Inside the processor there is normally an instruction cache. Some instructions (like "send") can switch the processor to PIC mode. In that case instructions are fetched from a PIC cache instead of the code cache. The instruction doing the switch must supply a "type" parameter and the PIC cache is accessed with a 64 bit <PC,Type> address. The PC part is the address of the instruction which switched to PIC mode and the instructions are streamed from the PIC cache without changing the PC. If the whole cache line is used then execution goes back to the normal mode at PC+1. PIC mode is also exited if any branch or call instructions are executed. So if you execute a send instruction then the next instruction is determined by the receiver's class (type). Suppose that this send bytecode is at PC=16r213412 in the code cache and that the PIC cache has five entries tagged as <16r213412,X> with five different values of X. If the receiver type doesn't match any of them we call the copier (or translator) to create a new entry. If the receiver type is one of those five, then several instructions (up to the size of a cache line) are executed as if there were inlined between 16r213412 and 16r213413. A very common case is for these instructions to simply call a given method, but for simple and short methods they could be the body of the method itself. Some software has to manage all the related entries in the PIC cache and this is the source of type information when a method is being recompiled to generate more optimized code. David Faught wrote: > Wouldn't Content Addressable Memory (CAM) work for this? I can > remember reading about this years ago, and I know that some current > network routers and switches use this to speed up their table lookups. CAMs are the way to go at least for the caches inside the chip itself. For larger second level caches that live in main memory you wan't to use some kind of hashing solution. Unfortunately FPGAs are very bad at implementing CAMs, so I am forced to using hashing for the first level caches as well. On a custom chip version of the design I would do as you suggested. -- Jecel |
In reply to this post by Klaus D. Witzel
Hi Jecel,
on Fri, 17 Nov 2006 00:22:08 +0100, you wrote: > Klaus D. Witzel wrote on Wed, 15 Nov 2006 05:35:21 +0100 >> > This chip has the Jazelle 1 technology, >> >> Ah, I didn't associate Jazelle with ARM9. I thought that Jazelle was >> still >> for the lab guys. > > We had a thread about this before. ARM calls two entirely different > technologies "Jazelle". I described the older one below and it is > present in many commercially available chips. The new one is not out of > the labs and is just a special variation of their Thumb instruction set > optimized as a target for JIT compilers (so it would be nice for > Exupery, for example). Ah, IC. >> > which is unfortunately hardwired >> > for Java. The most common bytecodes get translated on the fly into a >> > single ARM instruction (just like Thumb) while the more complex ones >> > trap into a software implementation. An equivalent circuit for >> handling >> > Squeak bytecodes would be fantastic. >> >> Well, even the JVM is an Universal Turing machine ;-) > > Sure. > >> I have a Squeak compiler which emits the JVM's bytecodes (into regular >> class files). >> Just sufficient support for the compiler compiling itself :) > > Very nice. I think there have been other Smalltalks which ran on the > JVM. At least I seem to remember something called "Bistro". Robert Tolksdorf collected an impressive (200 different systems) list of system wich use JVM as is @ - http://www.robert-tolksdorf.de/vmlanguages.html >> Now every JVM does dynamic dispatch if the correct opcodes are used >> (O.K. >> the bytecode verifier has to be convinced but that is only pro-forma and >> does not affect the run-time; recently Adrian Kuhn had the same idea how >> to do this; even GCJ knows how to do this :) > > In the case of someone just wanting to use the Jazelle technology for > Squeak instead of running it on a regular JVM there is no need to worry > about thinking like the bytecode verifier. You can bend the rules as > much as you like. Sure :) The point I meant was just, to convince the verifyer while still debugging and testing on regular JVM platforms. Saves a lot of time and headaches. The first utility I wrote into this direction was for replacing CAST bytecodes by NOOP. Many CASTs are just required by stupid static-minded regular javac / GCC, not by the bytecode verifier nor the JVM ;-) >> No "Invokedynamic" is needed >> (except if that would automagically do boxing/unboxing, then I'd employ >> "Invokedynamic"). And only field access (instVar, classVar) needs the >> CAST >> opcode ;-) Well, I think that especially Jazelle can live without the >> CAST >> opcode :) > > Probably. > > About the Invokedynamic - give how Jazelle works (essentially a > bytecode->ARM translation ROM with all the complicated instructions > translated to "call xxx") it is probably no more and nor less costly > than the other invoke bytecodes. So some optimizations you might need on > a normal JVM might not get the same results here. Of course. >> I would like to see this all running on Jazelle but lack the expertise >> for >> choosing the right platform for *not* producing an expensive failure. >> Can >> you help me with your expertise to choose a Jazelle platform, *that* >> would >> be fantastic. > > That depends on many details. Since this is very off topic we should > move it off squeak-dev. O.K. can you email items / topics / concerns for starting an off-list discussion, please. Thank you. Even if it'd turn out to be infeasible, I'd like to find out why so. OTHO if it's feasible and not expensive I'd like to create such a system on as common HW+SW platform as possible :) /Klaus > -- Jecel > > |
Free forum by Nabble | Edit this page |