Some of you know that I have been working on the design of the SiliconSqueak processor which is optimized for the OpenSmalltalk VM. I am currently redesigning it to be a proper extension to the RISC-V standard instruction set (https://riscv.org/). While the result will not be as technically elegant as previous versions, it should be much more interesting commercially. RISC-V was started at Berkeley in 2010 and has a growing community which I expect will make it the third most important instruction set after the x86 and ARM very soon. It was created to be expandable, so you start with one of the four integer instruction subsets (RV32I, RV64I, RV128I or RV32E with only 16 registers for embedded systems) and then you can optionally have some of the standard extensions: http://linuxgizmos.com/files/riscv_ref_card1.jpg http://linuxgizmos.com/files/riscv_ref_card2.jpg Integer Multiplication and Division (M) Atomics (A) Single-Precision Floating-Point (F) Double-Precision Floating-Point (D) The IMAFD combination is considered to be popular enough that you can use G (for General) instead. So RV32G is the same thing as RV32IMAFD. Some more standard extensions are in the process of being defined: Quad-Precision Floating-Point (Q) Decimal Floating-Point (L) 16-bit Compressed Instructions (C) Bit Manipulation (B) Dynamic Languages (J) Transactional Memory (T) Packed-SIMD Extensions (P) Vector Extensions (V) User-Level Interrupts (N) Non standard extensions are also allowed. For example: Xhwacha means the processor has the Hwacha vector extension that is different from the V vector extension that is being defined. So a processor named RV32GXsisq would be a general 32 bit RISC-V with SIliconSQueak extensions (which I will describe below). I am in the process of becoming a RISC-V Foundation member so I have join the J Extension work group since I feel I can help though the "J" is meant to imply Java and Javascript. It would also be a good idea if any needless incompatiblity between Xsisq and J can be avoided. My goal is to both improve the performance and efficiency of the bytecode interpreter and make the processor a better target for adaptive compilation. The ARM folks also did this in Jazalle 1 (later renamed DBX - direct bytecode execution) and Jazelle 2 (renamed RCT - runtime compilation target and later Thumb EE). I want to modify Cog to both work with RV32G and with my extensions and have the simulator interface with a cycle accurate processor simulator that can let us measure the effects of the extensions for both simple implementations and fancier out-of-order execution engines. It would be nice to have data to guide the design of these extensions before all this work, but the fact that critical operations are done via macros in the OpenSmalltalk VM (as they should!) instead of subroutines make it hard to know how much time is spent on PICs or allocating new objects, for example. == Xsisqbytecodes This is a special execution mode which uses two extra regiters: IP pointing to the next bytecodes and BCTableBase which points to an aligned 1024 byte table in memory. The instruction at BCTableBase + (* (char *) IP++) << 4 is executed. If the instruction is some kind of branch then bytecode mode is exited, otherwise the next bytecode is fetched. Combined with the next extension, the most common bytecodes can be interpreted by a single RISC-V instruction. == Xsisqstack Registers x16 to x31 are remapped to a moving window (not quite like RISC I, RISC II, RISC III (SOAR - Smalltalk On A RISC) or RISC IV (Spur)) but in the same spirit. In addition, register X15 becomes an alias for one of the others so you can push and pop to the stack implicitly instead of using extra instructions. When combined with the next extension, each register and word on the stack gets a 33rd (or 65th or 129th) bit that can't be changed by the software and which distinguishes between raw values and tagged values. == Xsisqobjectmemory Normal RISC-V load and store instructions generate a virtual address by adding a 12 bit immediate value to a register value. This virtual address is translated to a physical address by the MMU and that is used to access the cache. In the object addressing mode the value in the register is considered a virtual object ID and the immediate value is an offset. They are not added but used separately by the cache hash to access the desired word. On a cache miss an object table is used to find the physical address. The object formats are known so loads and stores can tell raw values from tagged words. Both V3 and Spur images use direct pointers and so can't take advantage of this mode, though the RoarVM could probably be adapted since it uses object tables. == Xsisqpic A special PIC execution mode is entered with an instruction that reads an object's class. The instruction cache loads a line hashed by both the class and the current PC and the instructions there are executed until some branch happens. So it is like a combination of call and switch. This means that PICs take the same time no matter how many entries they have. If all levels of caches miss then this means that the compiler has to be called to create a new PIC entry for this PC/class combination. Besides these four extensions, there is a Xteam extension which allows a group of cores to work together on a single piece of code. Using this will require adding a new compiler to Cog, so I won't go into it here. There are interesting possible extensions which I have not worked on: - support for GC (though Xsisqobjectmemory does help) like read or write barriers - support for execution counters - support of JIT like the hardware accelerators in this RISC-V to VLIW runtime compiler: > https://riscv.org/wp-content/uploads/2017/05/Wed1545-HybridDBT-Rokicki.pdf Any other ideas? -- Jecel |
Hi Jecel,
On Wed, Mar 28, 2018 at 2:25 PM, Jecel Assumpcao Jr. <[hidden email]> wrote:
One thing that should be straight-forward to do is to modify the simulator to collect instructions and attribute them to specific tasks. This wouldn't give you cycle counts for each kind of operation but one could augment the instruction counts with max and min cycle counts for a range quite easily (but getting that cycle data is tedious without a convenient online form). I don't think that Bochs has any support for accurate cycle counts, or page faults etc. But I think instruction counts would be very interesting. Ds this sound worthwhile to you?
Ah, that's nice. So BCTableBase can be used to implement multiple bytecode sets right? What support is there for setting BCTableBase, e.g. on return? == Xsisqstack Some examples would make this clearer. I don't understand the extra tag bit. I find a single tag bit restrictive. Is the scheme more general than I'm presuming? == Xsisqobjectmemory I wonder if this will fit with Clément's new PhD student's work on extending Cog for vector instructions and with Ronie's Lowcode work. We also want to extend the compiler, in this case to include vector support.
My ideas are probably too naive, and too specific to the current Cog VM. But they're simple things like - having a conditional move which doesn't raise an exception if it doesn't move data make writing the class test fast. Unfortunately the x86/x86_64 conditional move raises an exception if given an invalid address when the condition is false, so one can't use it to do the "if the value is untagged, fetch the class" operation without a jump, because when the value is tagged, even though no data is moved, an exception is raised. - having a status register, or status bits, or bits in the tagged arithmetic instruction itself, which define what is the valid tag pattern for tagged arithmetic instructions would be nice. VW on SPARC never used the tagged arithmetic instructions because they mandated 00 as the tag pattern for tagged integers. Sad. Having a non-trapping tagged arithmetic instruction (add tagged and skip if untagged) would be nice (because traps are . Putting the immediate floating point encode/decode sequences in hardware would be nice. But I'm intrigued by your statistics gathering suggestion. It would be great to have finer grained stats. Although I wonder how valid those stats would be in a processor with extensive support for OO. -- Jecel _,,,^..^,,,_ best, Eliot |
Eliot, thank you for your comments. > > [macros and gathering data] > > One thing that should be straight-forward to do is to modify the > simulator to collect instructions and attribute them to specific tasks. You mean individual instructions or short sequences of instructions? Perhaps we are talking about different things. What I meant was that watching Clement's video about fixing a bug in the VM I noticed that the code for #new had been inlined into the method he was looking at. That same short sequence probably shows up in a lot of places. I want to know what percent of the time is spent in all such sequences added together. If it turns out to be 0.03% of the time then special hardware to make #new faster would be a bad idea. Most of the data in the green book was easy to get because something like #new would be a subroutine and normal profiling tools can tell you how much time is spent in a given subroutine. > This wouldn't give you cycle counts for each kind of operation but > one could augment the instruction counts with max and min cycle > counts for a range quite easily (but getting that cycle data is tedious > without a convenient online form). I don't think that Bochs has any > support for accurate cycle counts, or page faults etc. But I think > instruction counts would be very interesting. Ds this sound worthwhile > to you? The simulators I have written so far also suppose infinite caches and so on. I am going to fix this on my next one. The x86 is pretty complicated (which is why you adopted Bochs in the first place, right?) but an ARM or RISC-V simulator in Squeak that we can more easily instrument should be doable. But that is at the user level - page faults and such require simulating the supervisor level and an OS. About instruction counts, they are certainly very important even if less helpful than cycle counts. > > [BCTableBase + (* (char *) IP++) << 4] > > Ah, that's nice. So BCTableBase can be used to implement multiple > bytecode sets right? What support is there for setting BCTableBase, > e.g. on return? I have not yet looked at how multiple bytecodes set are handled in Cog. What I proposed is enough to let different threads have different bytecodes sets at least. Changing bytecode sets on send and return would mean explcitly changing BCTableBase in the prolog/epilog sequences of each method and that might cause some thrashing in the instruction cache. > > [extra bit in registers and stack] > > Some examples would make this clearer. I don't understand the extra > tag bit. I find a single tag bit restrictive. Is the scheme more general > than I'm presuming? This makes it a three level scheme: <32 bits> 0 = raw values (like in a ByteArray or Bitmap) <32 bits> 1 = see tag in low bits: <31 bits> 1 1 = SmallInteger <30 bits> 10 1 = Character <30 bits> 00 1 = oop - see object header for more details The 33rd bit shown above doesn't exist in memory or cache, but is created by load instructions (which must know which addresses have tagged values and which don't, which is pretty complicated in the normal Squeak image formats) and is checked by the store instructions. Each integer register has an associated 33rd bit and that is saved and restored as it is spilled to the stack. When using this mode you can convert a SmallInteger to a 32 bit raw word and the other way around (if it was actually a 31 bit value) but you can't manipulate an oop in any way at the user level (you can trap to the supervisor level to have that do it for you). Note that there is a tagged RISC-V implementation and a second one that is migrating from MIPS64 to RISC-V and both store the extra bits for each word in memory: http://www.lowrisc.org/docs/tagged-memory-v0.1/tags/ https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/ Both of these solutions are trying to improve the security of C code. Things are simpler for us. Note that this option wouldn't be used for V3 or Spur - a secure VM needs this plus a few other things (like checking that an image hasn't been changed by some external program) to make any sense (so it would have to be signed or something like that). > > Besides these four extensions, there is a Xteam extension which allows a > > group of cores to work together on a single piece of code. Using this > > will require adding a new compiler to Cog, so I won't go into it here. > > I wonder if this will fit with Clément's new PhD student's work on extending > Cog for vector instructions and with Ronie's Lowcode work. We also want > to extend the compiler, in this case to include vector support. It would probably work great! But the standard V extension is pretty good too and might be a better option. Let me go into the history of this: One thing I wanted to do when SiliconSqueak started in 2009 was to take advantage of the fact it was going to be implemented in an FPGA to add an extra level of compilation so that the most critical loops would get converted to hardware. I had helped a friend with his PhD in the 1990s that was a step in this direction (his project was called SelfHDL). Unfortunately the secretive nature of FPGA tools made this very hard to do. In 2010 I came up with the idea of an ALU Matrix coprocessor with, for example, 8 by 8 ALUs of 8 bits each that had a nice way to exchange data with their neighbors. So the extra compiler could target that instead of trying to generate hardware and I could still take advantage of the FPGA by changing from few cores with coprocessors to more cores without depending on the needs of the compiled code. > https://www.researchgate.net/publication/312029084_2014_poster_reconfiguration A big problem was that the coprocessor had a lot of internal state and also a big program memory which made sharing it between different tasks very costly (this was the worst problem of the Intel i860). In 2016 (after my PhD deadline had passed anyway) I worked on this by making the ALU matrix an independent processor with an instruction cache instead of manually loaded instruction memory (even though the Cell processors, which were similar, turned out to be very hard to program). In 2017 I worked on vector registers very much like the RISC-V V Extension. Some of the code that the ALU Matrix could handle was not very regular and the vectors don't help in those cases. So at the end of 2017 I came up with a way to make several simple cores work together when needed but go their own way otherwise. This would work with fixed hardware instead of changing things in an FPGA. Not as interesting for a PhD but better for a commercial product. The idea is that you can map some registers to channels like in the old Transputer. So you might say that register x12 in core 7 goes to register x10 in core 11. Trying to read from a channel that hasn't been written to yet will block as will trying to write a second time to a channel which hasn't read the previous value. Running the same code on all cores will be a good aproximation of a vector system, but more costly as each core has its own control logic, PC and so on. But you might compile a basic block to run on a team of 6 cores, for example, and have each core do a different things. This is the software equivalent of an out-of-order single core (since different cores can move forward at different rates except when blocked by the channels). Here I am thinking not only of an advanced Cog but what would be good for Qemu or MAME. This is similar to the old RAW project from MIT. > My ideas are probably too naive, and too specific to the current Cog VM. But > they're simple things like- having a conditional move which doesn't raise an > exception if it doesn't move data make writing the class test fast. Unfortunately > the x86/x86_64 conditional move raises an exception if given an invalid address > when the condition is false, so one can't use it to do the "if the value is untagged, > fetch the class" operation without a jump, because when the value is tagged, even > though no data is moved, an exception is raised. I had no idea. The description of the instruction says it loads the source into a temporary register and then if the condition is true it stores the temporary register in the destination. So the trap happens before the condition is even looked at. Note that your use of "untagged" and mine above are very different. But I got what you meant (oop). > - having a status register, or status bits, or bits in the tagged arithmetic instruction > itself, which define what is the valid tag pattern for tagged arithmetic instructions > would be nice. VW on SPARC never used the tagged arithmetic instructions because > they mandated 00 as the tag pattern for tagged integers. Sad. Having a non-trapping > tagged arithmetic instruction (add tagged and skip if untagged) would be nice > (because traps are . In the first few versions of SiliconSqueak I had configurable tag hardware. This is the description of the tagConfig register: # Tag configuration defines the operation of the two detagging units (associated with operands # A and B) and the retagging unit (associated with the destination). The lowest 16 bits of the # register indicate valid SmallInteger combinations of d31, d30, d1 and d0. The next higher 4 # bits are ANDed to d31, d30, d1 and d0 when converting from tagged to untagged SmallInteger # while 4 other bits are ORed to d31, d30, d1 and d0 when converting from untagged to tagged # SmallIntegers. 2 bits indicate how much to shift right when converting from tagged to untagged # SmallIntegers and the same bits indicate how much to shift left for the reverse operation. The # top 6 bits are undefined. # # For Squeak the bottom bit is set to 1 for SmallIntegers, so this register must be set to hex # value 011EAAAA. The AAAA defines all odd values as valid SmallIntegers. The E will clear # d0 when converting to raw bits and the bottom 1 will set it when retagging. The top 1 will divide # the tagged value by 2 and multiply it back when retagging. For Self the bottom two bits are 0 # for SmallIntegers, so this register must be set to hex value 020F1111. An option that works well # in hardware but is complicated to deal with in software is when the top two bits must match in # SmallIntegers. This can be handled by setting this register to hex value 000FF00F. For the 2016 version I decided this was a costly complication and just made it always use the last configuration (where the top two bits 00 or 11 indicate a SmallInteger). In software this is very complicated to do but in hardware it is just a single XOR gate). This meant converting when loading or saving V3 or Spur images, of course. > Putting the immediate floating point encode/decode sequences in hardware would be nice. With the D and F extensions you already have to let 32 and 64 bit floats share the same registers. So it wouldn't add much to let tagged floats be an option. One of the proposals is the L extension which would allow decimal floats. That is a lot more complicated. > But I'm intrigued by your statistics gathering suggestion. It would be great to have > finer grained stats. Although I wonder how valid those stats would be in a processor > with extensive support for OO. Hardly my idea - one of the RISC-V creators wrote a book about using a Quantative Approach to designs processors, after all. In fact, the various papers and thesis published for RISC-III (SOAR) are a great example of what we need to do. And it showed how following your intuition can have bad results. For example: all their instructions were tagged but in the end only the tagged ADD made the slightest difference. They also automatically filled registers with Nil but it didn't help. They didn't have the statistics gathering I want, however. What they did was to turn features on and off in their compiler and look at total execution time for a bunch of benchmarks. In my case it would be like running the benchmarks where we take advantage of using x15 to push/pop stuff and then run them again but with explicit stack pointer manipulation sequences. That just gives you an overall number. -- Jecel |
Hi Jecel, > On Mar 28, 2018, at 5:09 PM, Jecel Assumpcao Jr. <[hidden email]> wrote: > > > Eliot, > > thank you for your comments. > >>> [macros and gathering data] >> >> One thing that should be straight-forward to do is to modify the >> simulator to collect instructions and attribute them to specific tasks. > > You mean individual instructions or short sequences of instructions? > Perhaps we are talking about different things. What I meant was that > watching Clement's video about fixing a bug in the VM I noticed that the > code for #new had been inlined into the method he was looking at. That > same short sequence probably shows up in a lot of places. I want to know > what percent of the time is spent in all such sequences added together. Exactly. > If it turns out to be 0.03% of the time then special hardware to make > #new faster would be a bad idea. Right, exactly. We want to know what is cheap on current hardware and what is expensive, and what could be made cheaper. > Most of the data in the green book was easy to get because something > like #new would be a subroutine and normal profiling tools can tell you > how much time is spent in a given subroutine. Right. The Cog JIT generates abstract instructions in an assembly-like style. We can add state that the abstract instructions easily. So if we modified some generation routines to set a "context" flag, such as "doing allocation", "doing dispatch", "doing store check", "doing marshalling", etc, we could label each instruction in the sequences we're interested in with that flag. Then, for example, when we generate we could reduce that flag to a set of bit vectors, one bit per byte of instruction space, one bit vector per "interesting code type", and one bit vector that ors them together. Then on simulating each instruction we can test its address in the bit vectors and find out what kind it is. It will slow down simulation, but we're happy to pay the premium fees when we're gathering statistics. > >> This wouldn't give you cycle counts for each kind of operation but >> one could augment the instruction counts with max and min cycle >> counts for a range quite easily (but getting that cycle data is tedious >> without a convenient online form). I don't think that Bochs has any >> support for accurate cycle counts, or page faults etc. But I think >> instruction counts would be very interesting. Ds this sound worthwhile >> to you? > > The simulators I have written so far also suppose infinite caches and so > on. I am going to fix this on my next one. The x86 is pretty complicated > (which is why you adopted Bochs in the first place, right? Right. I had considered trying to generate a simulator given a semantics (e.g. as in the intel manual) but Josh Gargus made me see sense and pointed me at Bochs. > ) but an ARM > or RISC-V simulator in Squeak that we can more easily instrument should > be doable. But that is at the user level - page faults and such require > simulating the supervisor level and an OS. > > About instruction counts, they are certainly very important even if less > helpful than cycle counts. Useful enough data for not too much effort. Very not accurate cycle counts will be a lot more work and much harder set to prove correct. In any case they depend on processor implementation and given the those implementations evolve quickly I have never thought of worthwhile doing micro measurements to find out what are fast instruction sequences on specific versions. People do do this and get great results. I've simply never worked in a situation where I felt I could afford the effort. > >>> [BCTableBase + (* (char *) IP++) << 4] >> >> Ah, that's nice. So BCTableBase can be used to implement multiple >> bytecode sets right? What support is there for setting BCTableBase, >> e.g. on return? > > I have not yet looked at how multiple bytecodes set are handled in Cog. Claus's scheme is to maintain a single variable (at interpretation and JIT time) called bytecodeSetOffset, which has values 0, 256, 512, 768 etc, and this is added to the byte fetched. bytecodeSetOffset must be set on activating a method and returning from one. It is essentially the same idea as maintaining BCTableBase as a variable. > What I proposed is enough to let different threads have different > bytecodes sets at least. Changing bytecode sets on send and return would > mean explcitly changing BCTableBase in the prolog/epilog sequences of > each method and that might cause some thrashing in the instruction > cache. It might be worth the cost :-). And lookahead logic might reduce the cost. But I have no real idea :-) > >>> [extra bit in registers and stack] >> >> Some examples would make this clearer. I don't understand the extra >> tag bit. I find a single tag bit restrictive. Is the scheme more general >> than I'm presuming? > > This makes it a three level scheme: > > <32 bits> 0 = raw values (like in a ByteArray or Bitmap) > <32 bits> 1 = see tag in low bits: > <31 bits> 1 1 = SmallInteger > <30 bits> 10 1 = Character > <30 bits> 00 1 = oop - see object header for more details > > The 33rd bit shown above doesn't exist in memory or cache, but is > created by load instructions (which must know which addresses have > tagged values and which don't, which is pretty complicated in the normal > Squeak image formats) and is checked by the store instructions. Each > integer register has an associated 33rd bit and that is saved and > restored as it is spilled to the stack. > > When using this mode you can convert a SmallInteger to a 32 bit raw word > and the other way around (if it was actually a 31 bit value) but you > can't manipulate an oop in any way at the user level (you can trap to > the supervisor level to have that do it for you). > > Note that there is a tagged RISC-V implementation and a second one that > is migrating from MIPS64 to RISC-V and both store the extra bits for > each word in memory: > > http://www.lowrisc.org/docs/tagged-memory-v0.1/tags/ > https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/ > > Both of these solutions are trying to improve the security of C code. > Things are simpler for us. Note that this option wouldn't be used for V3 > or Spur - a secure VM needs this plus a few other things (like checking > that an image hasn't been changed by some external program) to make any > sense (so it would have to be signed or something like that). > >>> Besides these four extensions, there is a Xteam extension which allows a >>> group of cores to work together on a single piece of code. Using this >>> will require adding a new compiler to Cog, so I won't go into it here. >> >> I wonder if this will fit with Clément's new PhD student's work on extending >> Cog for vector instructions and with Ronie's Lowcode work. We also want >> to extend the compiler, in this case to include vector support. > > It would probably work great! But the standard V extension is pretty > good too and might be a better option. Let me go into the history of > this: > > One thing I wanted to do when SiliconSqueak started in 2009 was to take > advantage of the fact it was going to be implemented in an FPGA to add > an extra level of compilation so that the most critical loops would get > converted to hardware. I had helped a friend with his PhD in the 1990s > that was a step in this direction (his project was called SelfHDL). > Unfortunately the secretive nature of FPGA tools made this very hard to > do. In 2010 I came up with the idea of an ALU Matrix coprocessor with, > for example, 8 by 8 ALUs of 8 bits each that had a nice way to exchange > data with their neighbors. So the extra compiler could target that > instead of trying to generate hardware and I could still take advantage > of the FPGA by changing from few cores with coprocessors to more cores > without depending on the needs of the compiled code. > >> https://www.researchgate.net/publication/312029084_2014_poster_reconfiguration > > A big problem was that the coprocessor had a lot of internal state and > also a big program memory which made sharing it between different tasks > very costly (this was the worst problem of the Intel i860). In 2016 > (after my PhD deadline had passed anyway) I worked on this by making the > ALU matrix an independent processor with an instruction cache instead of > manually loaded instruction memory (even though the Cell processors, > which were similar, turned out to be very hard to program). In 2017 I > worked on vector registers very much like the RISC-V V Extension. Some > of the code that the ALU Matrix could handle was not very regular and > the vectors don't help in those cases. > > So at the end of 2017 I came up with a way to make several simple cores > work together when needed but go their own way otherwise. This would > work with fixed hardware instead of changing things in an FPGA. Not as > interesting for a PhD but better for a commercial product. The idea is > that you can map some registers to channels like in the old Transputer. > So you might say that register x12 in core 7 goes to register x10 in > core 11. Trying to read from a channel that hasn't been written to yet > will block as will trying to write a second time to a channel which > hasn't read the previous value. > > Running the same code on all cores will be a good aproximation of a > vector system, but more costly as each core has its own control logic, > PC and so on. But you might compile a basic block to run on a team of 6 > cores, for example, and have each core do a different things. This is > the software equivalent of an out-of-order single core (since different > cores can move forward at different rates except when blocked by the > channels). Here I am thinking not only of an advanced Cog but what would > be good for Qemu or MAME. This is similar to the old RAW project from > MIT. > >> My ideas are probably too naive, and too specific to the current Cog VM. But >> they're simple things like- having a conditional move which doesn't raise an >> exception if it doesn't move data make writing the class test fast. Unfortunately >> the x86/x86_64 conditional move raises an exception if given an invalid address >> when the condition is false, so one can't use it to do the "if the value is untagged, >> fetch the class" operation without a jump, because when the value is tagged, even >> though no data is moved, an exception is raised. > > I had no idea. The description of the instruction says it loads the > source into a temporary register and then if the condition is true it > stores the temporary register in the destination. So the trap happens > before the condition is even looked at. > > Note that your use of "untagged" and mine above are very different. But > I got what you meant (oop). > >> - having a status register, or status bits, or bits in the tagged arithmetic instruction >> itself, which define what is the valid tag pattern for tagged arithmetic instructions >> would be nice. VW on SPARC never used the tagged arithmetic instructions because >> they mandated 00 as the tag pattern for tagged integers. Sad. Having a non-trapping >> tagged arithmetic instruction (add tagged and skip if untagged) would be nice >> (because traps are . > > In the first few versions of SiliconSqueak I had configurable tag > hardware. This is the description of the tagConfig register: > > # Tag configuration defines the operation of the two detagging units > (associated with operands > # A and B) and the retagging unit (associated with the destination). > The lowest 16 bits of the > # register indicate valid SmallInteger combinations of d31, d30, d1 and > d0. The next higher 4 > # bits are ANDed to d31, d30, d1 and d0 when converting from tagged to > untagged SmallInteger > # while 4 other bits are ORed to d31, d30, d1 and d0 when converting > from untagged to tagged > # SmallIntegers. 2 bits indicate how much to shift right when converting > from tagged to untagged > # SmallIntegers and the same bits indicate how much to shift left for > the reverse operation. The > # top 6 bits are undefined. > # > # For Squeak the bottom bit is set to 1 for SmallIntegers, so this > register must be set to hex > # value 011EAAAA. The AAAA defines all odd values as valid > SmallIntegers. The E will clear > # d0 when converting to raw bits and the bottom 1 will set it when > retagging. The top 1 will divide > # the tagged value by 2 and multiply it back when retagging. For Self > the bottom two bits are 0 > # for SmallIntegers, so this register must be set to hex value 020F1111. > An option that works well > # in hardware but is complicated to deal with in software is when the > top two bits must match in > # SmallIntegers. This can be handled by setting this register to hex > value 000FF00F. > > For the 2016 version I decided this was a costly complication and just > made it always use the last configuration (where the top two bits 00 or > 11 indicate a SmallInteger). In software this is very complicated to do > but in hardware it is just a single XOR gate). This meant converting > when loading or saving V3 or Spur images, of course. > >> Putting the immediate floating point encode/decode sequences in hardware would be nice. > > With the D and F extensions you already have to let 32 and 64 bit floats > share the same registers. So it wouldn't add much to let tagged floats > be an option. One of the proposals is the L extension which would allow > decimal floats. That is a lot more complicated. > >> But I'm intrigued by your statistics gathering suggestion. It would be great to have >> finer grained stats. Although I wonder how valid those stats would be in a processor >> with extensive support for OO. > > Hardly my idea - one of the RISC-V creators wrote a book about using a > Quantative Approach to designs processors, after all. In fact, the > various papers and thesis published for RISC-III (SOAR) are a great > example of what we need to do. And it showed how following your > intuition can have bad results. For example: all their instructions were > tagged but in the end only the tagged ADD made the slightest difference. > They also automatically filled registers with Nil but it didn't help. > > They didn't have the statistics gathering I want, however. What they did > was to turn features on and off in their compiler and look at total > execution time for a bunch of benchmarks. > > In my case it would be like running the benchmarks where we take > advantage of using x15 to push/pop stuff and then run them again but > with explicit stack pointer manipulation sequences. That just gives you > an overall number. > > -- Jecel |
Eliot, > Right, exactly. We want to know what is cheap on current hardware > and what is expensive, and what could be made cheaper. The last one is not easy. That is my complaint about Urs Hölzle's 1995 ECOOP paper where he concluded that the special hardware in Sparc didn't help Self. Part of his results were due to how costly traps were on Sparc 8 and Solaris (more than 1000 clock cycles) so using it for handling tags and register windows overflow/underflow wasn't a good idea. He compared the code generated by his Self compiler with that generated by C and didn't see any difference (hardly surprising since Sparc was optimized for C and his compiler was optimized for Sparc). The problem is that he couldn't measure the effect that any hardware features Sparc didn't include would have if they were added. He could make his compiler use or not register windows and report less then 1% difference. But he couldn't guess what would happen if a PIC instruction like I proposed were added. http://www.cs.ucsb.edu/~urs/oocsb/papers/ecoop95-arch.pdf > Right. The Cog JIT generates abstract instructions in an assembly-like > style. We can add state that the abstract instructions easily. So if we > modified some generation routines to set a "context" flag, such as > "doing allocation", "doing dispatch", "doing store check", "doing > marshalling", etc, we could label each instruction in the sequences we're > interested in with that flag. Great idea! I had not considered looking at the Smalltalk code executing while generating an instruction instead of looking at the generated instructions. And I was thinking of Slang to C, which matters for the Intepreter but not for code generated by Cog. > Then, for example, when we generate we could reduce that flag to a > set of bit vectors, one bit per byte of instruction space, one bit vector > per "interesting code type", and one bit vector that ors them together. If you have less than 8 interesting categories you could have one label byte per one instruction byte. That would be easy but wasteful since only the bits in the byte corresponding to the start of an instruction would matter. > Then on simulating each instruction we can test its address in the bit > vectors and find out what kind it is. It will slow down simulation, but > we're happy to pay the premium fees when we're gathering statistics. You could increment counters corresponding to each of the bits. > > About instruction counts, they are certainly very important even if less > > helpful than cycle counts. > > Useful enough data for not too much effort. Very not accurate cycle > counts will be a lot more work and much harder set to prove correct. > In any case they depend on processor implementation and given the > those implementations evolve quickly I have never thought of > worthwhile doing micro measurements to find out what are fast > instruction sequences on specific versions. People do do this and get > great results. I've simply never worked in a situation where I felt I could > afford the effort. For my own project I want to focus on very simple pipelined implementations, so instruction counts would be good enough. But for the J Extension work group we would need to know how the proposals would affect a more advanced implementation like BOOM (Berkeley Out of Order Machine). https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-157.html > > I have not yet looked at how multiple bytecodes set are handled in Cog. > > Claus's scheme is to maintain a single variable (at interpretation and JIT > time) called bytecodeSetOffset, which has values 0, 256, 512, 768 etc, > and this is added to the byte fetched. bytecodeSetOffset must be set > on activating a method and returning from one. It is essentially the > same idea as maintaining BCTableBase as a variable. It would be trivial to convert one to the other (just add or subtract a constant). In fact, if you arrange it so that the table is at address zero in code space then they are the same. -- Jecel |
In reply to this post by Jecel Assumpcao Jr
On Wed, Mar 28, 2018 at 06:25:29PM -0300, Jecel Assumpcao Jr. wrote: > > > It would be nice to have data to guide the design of these extensions > before all this work, but the fact that critical operations are done via > macros in the OpenSmalltalk VM (as they should!) instead of subroutines > make it hard to know how much time is spent on PICs or allocating new > objects, for example. > The assumption that cpp macros are faster is one of those things that we all believe to be true because everyone says it is so. It ain't necessarily so. The memory access macros used for the lowest level mapping of object memory to process memory space are defined in platforms/Cross/vm/sqMemoryAccess.h. These macros are essential for performance because they affect the efficiency of memory access operations at the lowest levels of object memory functions. Wrong. If you replace the cpp macros with high level Smalltalk (slang), and rely on the inliner in CCodeGenertor to unroll the Smalltalk into inlined C code, then the performance of the resulting interpreter is essentially the same as that of the interpreter implemented with cpp macros. This is important for two reasons: 1) By getting rid of the cpp macros, you open up the possibility of doing profiling and debugging of low level functions directly with generated C code that is not obscured by cpp macros. 2) If you want to look at time spent in individual functions that normally would be inlined (either because they are cpp macros, or because the slang inliner unrolled them in similar ways), then you can do so by disabling the inlining during C code generation. This produces a very slow VM, but one that can be easily profiled to see time spent in the individual functions. I have only ever measured this in the interpreter VM (see package MemoryAccess in the VMMaker repository), but it would be reasonable to expect similar results with oscog, at least with respect to the base interpreter and primitive functions. Dave |
David, > If you replace the cpp macros with high level Smalltalk (slang), and rely > on the inliner in CCodeGenertor to unroll the Smalltalk into inlined C code, > then the performance of the resulting interpreter is essentially the same > as that of the interpreter implemented with cpp macros. > > This is important for two reasons: > > 1) By getting rid of the cpp macros, you open up the possibility of doing > profiling and debugging of low level functions directly with generated C code > that is not obscured by cpp macros. I agree it is better, but in terms of measurement inlined functions and macros are equivalent. But: > 2) If you want to look at time spent in individual functions that normally > would be inlined (either because they are cpp macros, or because the > slang inliner unrolled them in similar ways), then you can do so by > disabling the inlining during C code generation. This produces a very > slow VM, but one that can be easily profiled to see time spent in the > individual functions. Such results are way better than nothing but I think they will be distorted. When a function is inlined you can do additional optimization so the resulting code is not the same as in the function but without the call/return overhead. > I have only ever measured this in the interpreter VM (see package > MemoryAccess in the VMMaker repository), but it would be reasonable to > expect similar results with oscog, at least with respect to the base > interpreter and primitive functions. Like I said, I am interested both in the interpreter and in compiled code. Thanks for the tip! -- Jecel |
Free forum by Nabble | Edit this page |