SiliconSqueak and RISC-V J Extension

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

SiliconSqueak and RISC-V J Extension

Jecel Assumpcao Jr
 
Some of you know that I have been working on the design of the
SiliconSqueak processor which is optimized for the OpenSmalltalk VM. I
am currently redesigning it to be a proper extension to the RISC-V
standard instruction set (https://riscv.org/). While the result will not
be as technically elegant as previous versions, it should be much more
interesting commercially.

RISC-V was started at Berkeley in 2010 and has a growing community which
I expect will make it the third most important instruction set after the
x86 and ARM very soon. It was created to be expandable, so you start
with one of the four integer instruction subsets (RV32I, RV64I, RV128I
or RV32E with only 16 registers for embedded systems) and then you can
optionally have some of the standard extensions:

http://linuxgizmos.com/files/riscv_ref_card1.jpg
http://linuxgizmos.com/files/riscv_ref_card2.jpg

Integer Multiplication and Division (M)
Atomics (A)
Single-Precision Floating-Point (F)
Double-Precision Floating-Point (D)

The IMAFD combination is considered to be popular enough that you can
use G (for General) instead. So RV32G is the same thing as RV32IMAFD.
Some more standard extensions are in the process of being defined:

Quad-Precision Floating-Point (Q)
Decimal Floating-Point (L)
16-bit Compressed Instructions (C)
Bit Manipulation (B)
Dynamic Languages (J)
Transactional Memory (T)
Packed-SIMD Extensions (P)
Vector Extensions (V)
User-Level Interrupts (N)

Non standard extensions are also allowed. For example: Xhwacha means the
processor has the Hwacha vector extension that is different from the V
vector extension that is being defined. So a processor named RV32GXsisq
would be a general 32 bit RISC-V with SIliconSQueak extensions (which I
will describe below).

I am in the process of becoming a RISC-V Foundation member so I have
join the J Extension work group since I feel I can help though the "J"
is meant to imply Java and Javascript. It would also be a good idea if
any needless incompatiblity between Xsisq and J can be avoided.

My goal is to both improve the performance and efficiency of the
bytecode interpreter and make the processor a better target for adaptive
compilation. The ARM folks also did this in Jazalle 1 (later renamed DBX
- direct bytecode execution) and Jazelle 2 (renamed RCT - runtime
compilation target and later Thumb EE). I want to modify Cog to both
work with RV32G and with my extensions and have the simulator interface
with a cycle accurate processor simulator that can let us measure the
effects of the extensions for both simple implementations and fancier
out-of-order execution engines.

It would be nice to have data to guide the design of these extensions
before all this work, but the fact that critical operations are done via
macros in the OpenSmalltalk VM (as they should!) instead of subroutines
make it hard to know how much time is spent on PICs or allocating new
objects, for example.

== Xsisqbytecodes

This is a special execution mode which uses two extra regiters: IP
pointing to the next bytecodes and BCTableBase which points to an
aligned 1024 byte table in memory. The instruction at BCTableBase + (*
(char *) IP++) << 4 is executed. If the instruction is some kind of
branch then bytecode mode is exited, otherwise the next bytecode is
fetched. Combined with the next extension, the most common bytecodes can
be interpreted by a single RISC-V instruction.

== Xsisqstack

Registers x16 to x31 are remapped to a moving window (not quite like
RISC I, RISC II, RISC III (SOAR - Smalltalk On A RISC) or RISC IV
(Spur)) but in the same spirit. In addition, register X15 becomes an
alias for one of the others so you can push and pop to the stack
implicitly instead of using extra instructions. When combined with the
next extension, each register and word on the stack gets a 33rd (or 65th
or 129th) bit that can't be changed by the software and which
distinguishes between raw values and tagged values.

== Xsisqobjectmemory

Normal RISC-V load and store instructions generate a virtual address by
adding a 12 bit immediate value to a register value. This virtual
address is translated to a physical address by the MMU and that is used
to access the cache. In the object addressing mode the value in the
register is considered a virtual  object ID and the immediate value is
an offset. They are not added but used separately by the cache hash to
access the desired word. On a cache miss an object table is used to find
the physical address.

The object formats are known so loads and stores can tell raw values
from tagged words. Both V3 and Spur images use direct pointers and so
can't take advantage of this mode, though the RoarVM could probably be
adapted since it uses object tables.

== Xsisqpic

A special PIC execution mode is entered with an instruction that reads
an object's class. The instruction cache loads a line hashed by both the
class and the current PC and the instructions there are executed until
some branch happens. So it is like a combination of call and switch.
This means that PICs take the same time no matter how many entries they
have. If all levels of caches miss then this means that the compiler has
to be called to create a new PIC entry for this PC/class combination.

Besides these four extensions, there is a Xteam extension which allows a
group of cores to work together on a single piece of code. Using this
will require adding a new compiler to Cog, so I won't go into it here.

There are interesting possible extensions which I have not worked on:

- support for GC (though Xsisqobjectmemory does help) like read or write
barriers
- support for execution counters
- support of JIT like the hardware accelerators in this RISC-V to VLIW
runtime compiler:
> https://riscv.org/wp-content/uploads/2017/05/Wed1545-HybridDBT-Rokicki.pdf

Any other ideas?

-- Jecel
Reply | Threaded
Open this post in threaded view
|

Re: SiliconSqueak and RISC-V J Extension

Eliot Miranda-2
 
Hi Jecel,

On Wed, Mar 28, 2018 at 2:25 PM, Jecel Assumpcao Jr. <[hidden email]> wrote:

Some of you know that I have been working on the design of the
SiliconSqueak processor which is optimized for the OpenSmalltalk VM. I
am currently redesigning it to be a proper extension to the RISC-V
standard instruction set (https://riscv.org/). While the result will not
be as technically elegant as previous versions, it should be much more
interesting commercially.

RISC-V was started at Berkeley in 2010 and has a growing community which
I expect will make it the third most important instruction set after the
x86 and ARM very soon. It was created to be expandable, so you start
with one of the four integer instruction subsets (RV32I, RV64I, RV128I
or RV32E with only 16 registers for embedded systems) and then you can
optionally have some of the standard extensions:

http://linuxgizmos.com/files/riscv_ref_card1.jpg
http://linuxgizmos.com/files/riscv_ref_card2.jpg

Integer Multiplication and Division (M)
Atomics (A)
Single-Precision Floating-Point (F)
Double-Precision Floating-Point (D)

The IMAFD combination is considered to be popular enough that you can
use G (for General) instead. So RV32G is the same thing as RV32IMAFD.
Some more standard extensions are in the process of being defined:

Quad-Precision Floating-Point (Q)
Decimal Floating-Point (L)
16-bit Compressed Instructions (C)
Bit Manipulation (B)
Dynamic Languages (J)
Transactional Memory (T)
Packed-SIMD Extensions (P)
Vector Extensions (V)
User-Level Interrupts (N)

Non standard extensions are also allowed. For example: Xhwacha means the
processor has the Hwacha vector extension that is different from the V
vector extension that is being defined. So a processor named RV32GXsisq
would be a general 32 bit RISC-V with SIliconSQueak extensions (which I
will describe below).

I am in the process of becoming a RISC-V Foundation member so I have
join the J Extension work group since I feel I can help though the "J"
is meant to imply Java and Javascript. It would also be a good idea if
any needless incompatiblity between Xsisq and J can be avoided.

My goal is to both improve the performance and efficiency of the
bytecode interpreter and make the processor a better target for adaptive
compilation. The ARM folks also did this in Jazalle 1 (later renamed DBX
- direct bytecode execution) and Jazelle 2 (renamed RCT - runtime
compilation target and later Thumb EE). I want to modify Cog to both
work with RV32G and with my extensions and have the simulator interface
with a cycle accurate processor simulator that can let us measure the
effects of the extensions for both simple implementations and fancier
out-of-order execution engines.

It would be nice to have data to guide the design of these extensions
before all this work, but the fact that critical operations are done via
macros in the OpenSmalltalk VM (as they should!) instead of subroutines
make it hard to know how much time is spent on PICs or allocating new
objects, for example.

One thing that should be straight-forward to do is to modify the simulator to collect instructions and attribute them to specific tasks.  This wouldn't give you cycle counts for each kind of operation but one could augment the instruction counts with max and min cycle counts for a range quite easily (but getting that cycle data is tedious without a convenient online form).  I don't think that Bochs has any support for accurate cycle counts, or page faults etc.  But I think instruction counts would be very interesting.  Ds this sound worthwhile to you?
 

== Xsisqbytecodes

This is a special execution mode which uses two extra registers: IP
pointing to the next bytecodes and BCTableBase which points to an
aligned 1024 byte table in memory. The instruction at BCTableBase + (*
(char *) IP++) << 4 is executed. If the instruction is some kind of
branch then bytecode mode is exited, otherwise the next bytecode is
fetched. Combined with the next extension, the most common bytecodes can
be interpreted by a single RISC-V instruction.

Ah, that's nice.  So BCTableBase can be used to implement multiple bytecode sets right?  What support is there for setting BCTableBase, e.g. on return?

== Xsisqstack

Registers x16 to x31 are remapped to a moving window (not quite like
RISC I, RISC II, RISC III (SOAR - Smalltalk On A RISC) or RISC IV
(Spur)) but in the same spirit. In addition, register X15 becomes an
alias for one of the others so you can push and pop to the stack
implicitly instead of using extra instructions. When combined with the
next extension, each register and word on the stack gets a 33rd (or 65th
or 129th) bit that can't be changed by the software and which
distinguishes between raw values and tagged values.

Some examples would make this clearer.  I don't understand the extra tag bit.  I find a single tag bit restrictive.  Is the scheme more general than I'm presuming?

== Xsisqobjectmemory

Normal RISC-V load and store instructions generate a virtual address by
adding a 12 bit immediate value to a register value. This virtual
address is translated to a physical address by the MMU and that is used
to access the cache. In the object addressing mode the value in the
register is considered a virtual  object ID and the immediate value is
an offset. They are not added but used separately by the cache hash to
access the desired word. On a cache miss an object table is used to find
the physical address.

The object formats are known so loads and stores can tell raw values
from tagged words. Both V3 and Spur images use direct pointers and so
can't take advantage of this mode, though the RoarVM could probably be
adapted since it uses object tables.

== Xsisqpic

A special PIC execution mode is entered with an instruction that reads
an object's class. The instruction cache loads a line hashed by both the
class and the current PC and the instructions there are executed until
some branch happens. So it is like a combination of call and switch.
This means that PICs take the same time no matter how many entries they
have. If all levels of caches miss then this means that the compiler has
to be called to create a new PIC entry for this PC/class combination.

Besides these four extensions, there is a Xteam extension which allows a
group of cores to work together on a single piece of code. Using this
will require adding a new compiler to Cog, so I won't go into it here.

I wonder if this will fit with Clément's new PhD student's work on extending Cog for vector instructions and with Ronie's Lowcode work.  We also want to extend the compiler, in this case to include vector support.
 

There are interesting possible extensions which I have not worked on:

- support for GC (though Xsisqobjectmemory does help) like read or write
barriers
- support for execution counters
- support of JIT like the hardware accelerators in this RISC-V to VLIW
runtime compiler:
> https://riscv.org/wp-content/uploads/2017/05/Wed1545-HybridDBT-Rokicki.pdf

Any other ideas?

My ideas are probably too naive, and too specific to the current Cog VM.  But they're simple things like
- having a conditional move which doesn't raise an exception if it doesn't move data make writing the class test fast.  Unfortunately the x86/x86_64 conditional move raises an exception if given an invalid address when the condition is false, so one can't use it to do the "if the value is untagged, fetch the class" operation without a jump, because when the value is tagged, even though no data is moved, an exception is raised.

- having a status register, or status bits, or bits in the tagged arithmetic instruction itself, which define what is the valid tag pattern for tagged arithmetic instructions would be nice.  VW on SPARC never used the tagged arithmetic instructions because they mandated 00 as the tag pattern for tagged integers.  Sad.  Having a non-trapping tagged arithmetic instruction (add tagged and skip if untagged) would be nice (because traps are .  Putting the immediate floating point encode/decode sequences in hardware would be nice.

But I'm intrigued by your statistics gathering suggestion.  It would be great to have finer grained stats.  Although I wonder how valid those stats would be in a processor with extensive support for OO.

-- Jecel

_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: SiliconSqueak and RISC-V J Extension

Jecel Assumpcao Jr
 
Eliot,

thank you for your comments.

> > [macros and gathering data]
>
> One thing that should be straight-forward to do is to modify the
> simulator to collect instructions and attribute them to specific tasks.

You mean individual instructions or short sequences of instructions?
Perhaps we are talking about different things. What I meant was that
watching Clement's video about fixing a bug in the VM I noticed that the
code for #new had been inlined into the method he was looking at. That
same short sequence probably shows up in a lot of places. I want to know
what percent of the time is spent in all such sequences added together.
If it turns out to be 0.03% of the time then special hardware to make
#new faster would be a bad idea.

Most of the data in the green book was easy to get because something
like #new would be a subroutine and normal profiling tools can tell you
how much time is spent in a given subroutine.

> This wouldn't give you cycle counts for each kind of operation but
> one could augment the instruction counts with max and min cycle
> counts for a range quite easily (but getting that cycle data is tedious
> without a convenient online form).  I don't think that Bochs has any
> support for accurate cycle counts, or page faults etc.  But I think
> instruction counts would be very interesting.  Ds this sound worthwhile
> to you?

The simulators I have written so far also suppose infinite caches and so
on. I am going to fix this on my next one. The x86 is pretty complicated
(which is why you adopted Bochs in the first place, right?) but an ARM
or RISC-V simulator in Squeak that we can more easily instrument should
be doable. But that is at the user level - page faults and such require
simulating the supervisor level and an OS.

About instruction counts, they are certainly very important even if less
helpful than cycle counts.
 
> > [BCTableBase + (* (char *) IP++) << 4]
>
> Ah, that's nice.  So BCTableBase can be used to implement multiple
> bytecode sets right?  What support is there for setting BCTableBase,
> e.g. on return?

I have not yet looked at how multiple bytecodes set are handled in Cog.
What I proposed is enough to let different threads have different
bytecodes sets at least. Changing bytecode sets on send and return would
mean explcitly changing BCTableBase in the prolog/epilog sequences of
each method and that might cause some thrashing in the instruction
cache.

> > [extra bit in registers and stack]
>
> Some examples would make this clearer.  I don't understand the extra
> tag bit.  I find a single tag bit restrictive.  Is the scheme more general
> than I'm presuming?

This makes it a three level scheme:

<32 bits> 0 = raw values (like in a ByteArray or Bitmap)
<32 bits> 1 = see tag in low bits:
    <31 bits> 1 1 = SmallInteger
    <30 bits> 10 1 = Character
    <30 bits> 00 1 = oop - see object header for more details

The 33rd bit shown above doesn't exist in memory or cache, but is
created by load instructions (which must know which addresses have
tagged values and which don't, which is pretty complicated in the normal
Squeak image formats) and is checked by the store instructions. Each
integer register has an associated 33rd bit and that is saved and
restored as it is spilled to the stack.

When using this mode you can convert a SmallInteger to a 32 bit raw word
and the other way around (if it was actually a 31 bit value) but you
can't manipulate an oop in any way at the user level (you can trap to
the supervisor level to have that do it for you).

Note that there is a tagged RISC-V implementation and a second one that
is migrating from MIPS64 to RISC-V and both store the extra bits for
each word in memory:

http://www.lowrisc.org/docs/tagged-memory-v0.1/tags/
https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

Both of these solutions are trying to improve the security of C code.
Things are simpler for us. Note that this option wouldn't be used for V3
or Spur - a secure VM needs this plus a few other things (like checking
that an image hasn't been changed by some external program) to make any
sense (so it would have to be signed or something like that).
 
> > Besides these four extensions, there is a Xteam extension which allows a
> > group of cores to work together on a single piece of code. Using this
> > will require adding a new compiler to Cog, so I won't go into it here.
>
> I wonder if this will fit with Clément's new PhD student's work on extending
> Cog for vector instructions and with Ronie's Lowcode work.  We also want
> to extend the compiler, in this case to include vector support. 

It would probably work great! But the standard V extension is pretty
good too and might be a better option. Let me go into the history of
this:

One thing I wanted to do when SiliconSqueak started in 2009 was to take
advantage of the fact it was going to be implemented in an FPGA to add
an extra level of compilation so that the most critical loops would get
converted to hardware. I had helped a friend with his PhD in the 1990s
that was a step in this direction (his project was called SelfHDL).
Unfortunately the secretive nature of FPGA tools made this very hard to
do. In 2010 I came up with the idea of an ALU Matrix coprocessor with,
for example, 8 by 8 ALUs of 8 bits each that had a nice way to exchange
data with their neighbors. So the extra compiler could target that
instead of trying to generate hardware and I could still take advantage
of the FPGA by changing from few cores with coprocessors to more cores
without depending on the needs of the compiled code.

> https://www.researchgate.net/publication/312029084_2014_poster_reconfiguration

A big problem was that the coprocessor had a lot of internal state and
also a big program memory which made sharing it between different tasks
very costly (this was the worst problem of the Intel i860). In 2016
(after my PhD deadline had passed anyway) I worked on this by making the
ALU matrix an independent processor with an instruction cache instead of
manually loaded instruction memory (even though the Cell processors,
which were similar, turned out to be very hard to program). In 2017 I
worked on vector registers very much like the RISC-V V Extension. Some
of the code that the ALU Matrix could handle was not very regular and
the vectors don't help in those cases.

So at the end of 2017 I came up with a way to make several simple cores
work together when needed but go their own way otherwise. This would
work with fixed hardware instead of changing things in an FPGA. Not as
interesting for a PhD but better for a commercial product. The idea is
that you can map some registers to channels like in the old Transputer.
So you might say that register x12 in core 7 goes to register x10 in
core 11. Trying to read from a channel that hasn't been written to yet
will block as will trying to write a second time to a channel which
hasn't read the previous value.

Running the same code on all cores will be a good aproximation of a
vector system, but more costly as each core has its own control logic,
PC and so on. But you might compile a basic block to run on a team of 6
cores, for example, and have each core do a different things. This is
the software equivalent of an out-of-order single core (since different
cores can move forward at different rates except when blocked by the
channels). Here I am thinking not only of an advanced Cog but what would
be good for Qemu or MAME. This is similar to the old RAW project from
MIT.

> My ideas are probably too naive, and too specific to the current Cog VM.  But
> they're simple things like- having a conditional move which doesn't raise an
> exception if it doesn't move data make writing the class test fast.  Unfortunately
> the x86/x86_64 conditional move raises an exception if given an invalid address
> when the condition is false, so one can't use it to do the "if the value is untagged,
> fetch the class" operation without a jump, because when the value is tagged, even
> though no data is moved, an exception is raised.

I had no idea. The description of the instruction says it loads the
source into a temporary register and then if the condition is true it
stores the temporary register in the destination. So the trap happens
before the condition is even looked at.

Note that your use of "untagged" and mine above are very different. But
I got what you meant (oop).

> - having a status register, or status bits, or bits in the tagged arithmetic instruction
> itself, which define what is the valid tag pattern for tagged arithmetic instructions
> would be nice.  VW on SPARC never used the tagged arithmetic instructions because
> they mandated 00 as the tag pattern for tagged integers.  Sad.  Having a non-trapping
> tagged arithmetic instruction (add tagged and skip if untagged) would be nice
> (because traps are . 

In the first few versions of SiliconSqueak I had configurable tag
hardware. This is the description of the tagConfig register:

# Tag configuration defines the operation of the two detagging units
(associated with operands
# A and B) and the retagging unit (associated with the destination).
The lowest 16 bits of the
# register indicate valid SmallInteger combinations of d31, d30, d1 and
d0.  The next higher 4
# bits are ANDed to d31, d30, d1 and d0 when converting from tagged to
untagged SmallInteger
# while 4 other bits are ORed to d31, d30, d1 and d0 when converting
from untagged to tagged
# SmallIntegers. 2 bits indicate how much to shift right when converting
from tagged to untagged
# SmallIntegers and the same bits indicate how much to shift left for
the reverse operation.  The
# top 6 bits are undefined.
#
#  For Squeak the bottom bit is set to 1 for SmallIntegers, so this
register must be set to hex
# value 011EAAAA. The AAAA defines all odd values as valid
SmallIntegers.  The E will clear
# d0 when converting to raw bits and the bottom 1 will set it when
retagging. The top 1 will divide
# the tagged value by 2 and multiply it back when retagging.  For Self
the bottom two bits are 0
# for SmallIntegers, so this register must be set to hex value 020F1111.
An option that works well
# in hardware but is complicated to deal with in software is when the
top two bits must match in
# SmallIntegers. This can be handled by setting this register to hex
value 000FF00F.

For the 2016 version I decided this was a costly complication and just
made it always use the last configuration (where the top two bits 00 or
11 indicate a SmallInteger). In software this is very complicated to do
but in hardware it is just a single XOR gate). This meant converting
when loading or saving V3 or Spur images, of course.

> Putting the immediate floating point encode/decode sequences in hardware would be nice.

With the D and F extensions you already have to let 32 and 64 bit floats
share the same registers. So it wouldn't add much to let tagged floats
be an option. One of the proposals is the L extension which would allow
decimal floats. That is a lot more complicated.

> But I'm intrigued by your statistics gathering suggestion.  It would be great to have
> finer grained stats.  Although I wonder how valid those stats would be in a processor
> with extensive support for OO.

Hardly my idea - one of the RISC-V creators wrote a book about using a
Quantative Approach to designs processors, after all. In fact, the
various papers and thesis published for RISC-III (SOAR) are a great
example of what we need to do. And it showed how following your
intuition can have bad results. For example: all their instructions were
tagged but in the end only the tagged ADD made the slightest difference.
They also automatically filled registers with Nil but it didn't help.

They didn't have the statistics gathering I want, however. What they did
was to turn features on and off in their compiler and look at total
execution time for a bunch of benchmarks.

In my case it would be like running the benchmarks where we take
advantage of using x15 to push/pop stuff and then run them again but
with explicit stack pointer manipulation sequences. That just gives you
an overall number.

-- Jecel
Reply | Threaded
Open this post in threaded view
|

Re: SiliconSqueak and RISC-V J Extension

Eliot Miranda-2
 
Hi Jecel,

> On Mar 28, 2018, at 5:09 PM, Jecel Assumpcao Jr. <[hidden email]> wrote:
>
>
> Eliot,
>
> thank you for your comments.
>
>>> [macros and gathering data]
>>
>> One thing that should be straight-forward to do is to modify the
>> simulator to collect instructions and attribute them to specific tasks.
>
> You mean individual instructions or short sequences of instructions?
> Perhaps we are talking about different things. What I meant was that
> watching Clement's video about fixing a bug in the VM I noticed that the
> code for #new had been inlined into the method he was looking at. That
> same short sequence probably shows up in a lot of places. I want to know
> what percent of the time is spent in all such sequences added together.

Exactly.

> If it turns out to be 0.03% of the time then special hardware to make
> #new faster would be a bad idea.

Right, exactly.  We want to know what is cheap on current hardware and what is expensive, and what could be made cheaper.

> Most of the data in the green book was easy to get because something
> like #new would be a subroutine and normal profiling tools can tell you
> how much time is spent in a given subroutine.

Right.  The Cog JIT generates abstract instructions in an assembly-like style.  We can add state that the abstract instructions easily.  So if we modified some generation routines to set a "context" flag, such as "doing allocation", "doing dispatch", "doing store check", "doing marshalling", etc, we could label each instruction in the sequences we're interested in with that flag. Then, for example, when we generate we could reduce that flag to a set of bit vectors, one bit per byte of instruction space, one bit vector per "interesting code type", and one bit vector that ors them together.  Then on simulating each instruction we can test its address in the bit vectors and find out what kind it is.  It will slow down simulation, but we're happy to pay the premium fees when we're gathering statistics.

>
>>  This wouldn't give you cycle counts for each kind of operation but
>> one could augment the instruction counts with max and min cycle
>> counts for a range quite easily (but getting that cycle data is tedious
>> without a convenient online form).  I don't think that Bochs has any
>> support for accurate cycle counts, or page faults etc.  But I think
>> instruction counts would be very interesting.  Ds this sound worthwhile
>> to you?
>
> The simulators I have written so far also suppose infinite caches and so
> on. I am going to fix this on my next one. The x86 is pretty complicated
> (which is why you adopted Bochs in the first place, right?

Right.  I had considered trying to generate a simulator given a semantics (e.g. as in the intel manual) but Josh Gargus made me see sense and pointed me at Bochs.

> ) but an ARM
> or RISC-V simulator in Squeak that we can more easily instrument should
> be doable. But that is at the user level - page faults and such require
> simulating the supervisor level and an OS.
>
> About instruction counts, they are certainly very important even if less
> helpful than cycle counts.

Useful enough data for not too much effort.  Very not accurate cycle counts will be a lot more work and much harder set to prove correct.  In any case they depend on processor implementation and given the those implementations evolve quickly I have never thought of worthwhile doing micro measurements to find out what are fast instruction sequences on specific versions.  People do do this and get great results.  I've simply never worked in a situation where I felt I could afford the effort.

>  
>>> [BCTableBase + (* (char *) IP++) << 4]
>>
>> Ah, that's nice.  So BCTableBase can be used to implement multiple
>> bytecode sets right?  What support is there for setting BCTableBase,
>> e.g. on return?
>
> I have not yet looked at how multiple bytecodes set are handled in Cog.

Claus's scheme is to maintain a single variable (at interpretation and JIT time) called bytecodeSetOffset, which has values 0, 256, 512, 768 etc, and this is added to the byte fetched.  bytecodeSetOffset must be set on activating a method and returning from one.  It is essentially the same idea as maintaining BCTableBase as a variable.

> What I proposed is enough to let different threads have different
> bytecodes sets at least. Changing bytecode sets on send and return would
> mean explcitly changing BCTableBase in the prolog/epilog sequences of
> each method and that might cause some thrashing in the instruction
> cache.

It might be worth the cost :-). And lookahead logic might reduce the cost.  But I have no real idea :-)

>
>>> [extra bit in registers and stack]
>>
>> Some examples would make this clearer.  I don't understand the extra
>> tag bit.  I find a single tag bit restrictive.  Is the scheme more general
>> than I'm presuming?
>
> This makes it a three level scheme:
>
> <32 bits> 0 = raw values (like in a ByteArray or Bitmap)
> <32 bits> 1 = see tag in low bits:
>    <31 bits> 1 1 = SmallInteger
>    <30 bits> 10 1 = Character
>    <30 bits> 00 1 = oop - see object header for more details
>
> The 33rd bit shown above doesn't exist in memory or cache, but is
> created by load instructions (which must know which addresses have
> tagged values and which don't, which is pretty complicated in the normal
> Squeak image formats) and is checked by the store instructions. Each
> integer register has an associated 33rd bit and that is saved and
> restored as it is spilled to the stack.
>
> When using this mode you can convert a SmallInteger to a 32 bit raw word
> and the other way around (if it was actually a 31 bit value) but you
> can't manipulate an oop in any way at the user level (you can trap to
> the supervisor level to have that do it for you).
>
> Note that there is a tagged RISC-V implementation and a second one that
> is migrating from MIPS64 to RISC-V and both store the extra bits for
> each word in memory:
>
> http://www.lowrisc.org/docs/tagged-memory-v0.1/tags/
> https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/
>
> Both of these solutions are trying to improve the security of C code.
> Things are simpler for us. Note that this option wouldn't be used for V3
> or Spur - a secure VM needs this plus a few other things (like checking
> that an image hasn't been changed by some external program) to make any
> sense (so it would have to be signed or something like that).
>
>>> Besides these four extensions, there is a Xteam extension which allows a
>>> group of cores to work together on a single piece of code. Using this
>>> will require adding a new compiler to Cog, so I won't go into it here.
>>
>> I wonder if this will fit with Clément's new PhD student's work on extending
>> Cog for vector instructions and with Ronie's Lowcode work.  We also want
>> to extend the compiler, in this case to include vector support.
>
> It would probably work great! But the standard V extension is pretty
> good too and might be a better option. Let me go into the history of
> this:
>
> One thing I wanted to do when SiliconSqueak started in 2009 was to take
> advantage of the fact it was going to be implemented in an FPGA to add
> an extra level of compilation so that the most critical loops would get
> converted to hardware. I had helped a friend with his PhD in the 1990s
> that was a step in this direction (his project was called SelfHDL).
> Unfortunately the secretive nature of FPGA tools made this very hard to
> do. In 2010 I came up with the idea of an ALU Matrix coprocessor with,
> for example, 8 by 8 ALUs of 8 bits each that had a nice way to exchange
> data with their neighbors. So the extra compiler could target that
> instead of trying to generate hardware and I could still take advantage
> of the FPGA by changing from few cores with coprocessors to more cores
> without depending on the needs of the compiled code.
>
>> https://www.researchgate.net/publication/312029084_2014_poster_reconfiguration
>
> A big problem was that the coprocessor had a lot of internal state and
> also a big program memory which made sharing it between different tasks
> very costly (this was the worst problem of the Intel i860). In 2016
> (after my PhD deadline had passed anyway) I worked on this by making the
> ALU matrix an independent processor with an instruction cache instead of
> manually loaded instruction memory (even though the Cell processors,
> which were similar, turned out to be very hard to program). In 2017 I
> worked on vector registers very much like the RISC-V V Extension. Some
> of the code that the ALU Matrix could handle was not very regular and
> the vectors don't help in those cases.
>
> So at the end of 2017 I came up with a way to make several simple cores
> work together when needed but go their own way otherwise. This would
> work with fixed hardware instead of changing things in an FPGA. Not as
> interesting for a PhD but better for a commercial product. The idea is
> that you can map some registers to channels like in the old Transputer.
> So you might say that register x12 in core 7 goes to register x10 in
> core 11. Trying to read from a channel that hasn't been written to yet
> will block as will trying to write a second time to a channel which
> hasn't read the previous value.
>
> Running the same code on all cores will be a good aproximation of a
> vector system, but more costly as each core has its own control logic,
> PC and so on. But you might compile a basic block to run on a team of 6
> cores, for example, and have each core do a different things. This is
> the software equivalent of an out-of-order single core (since different
> cores can move forward at different rates except when blocked by the
> channels). Here I am thinking not only of an advanced Cog but what would
> be good for Qemu or MAME. This is similar to the old RAW project from
> MIT.
>
>> My ideas are probably too naive, and too specific to the current Cog VM.  But
>> they're simple things like- having a conditional move which doesn't raise an
>> exception if it doesn't move data make writing the class test fast.  Unfortunately
>> the x86/x86_64 conditional move raises an exception if given an invalid address
>> when the condition is false, so one can't use it to do the "if the value is untagged,
>> fetch the class" operation without a jump, because when the value is tagged, even
>> though no data is moved, an exception is raised.
>
> I had no idea. The description of the instruction says it loads the
> source into a temporary register and then if the condition is true it
> stores the temporary register in the destination. So the trap happens
> before the condition is even looked at.
>
> Note that your use of "untagged" and mine above are very different. But
> I got what you meant (oop).
>
>> - having a status register, or status bits, or bits in the tagged arithmetic instruction
>> itself, which define what is the valid tag pattern for tagged arithmetic instructions
>> would be nice.  VW on SPARC never used the tagged arithmetic instructions because
>> they mandated 00 as the tag pattern for tagged integers.  Sad.  Having a non-trapping
>> tagged arithmetic instruction (add tagged and skip if untagged) would be nice
>> (because traps are .
>
> In the first few versions of SiliconSqueak I had configurable tag
> hardware. This is the description of the tagConfig register:
>
> # Tag configuration defines the operation of the two detagging units
> (associated with operands
> # A and B) and the retagging unit (associated with the destination).
> The lowest 16 bits of the
> # register indicate valid SmallInteger combinations of d31, d30, d1 and
> d0.  The next higher 4
> # bits are ANDed to d31, d30, d1 and d0 when converting from tagged to
> untagged SmallInteger
> # while 4 other bits are ORed to d31, d30, d1 and d0 when converting
> from untagged to tagged
> # SmallIntegers. 2 bits indicate how much to shift right when converting
> from tagged to untagged
> # SmallIntegers and the same bits indicate how much to shift left for
> the reverse operation.  The
> # top 6 bits are undefined.
> #
> #  For Squeak the bottom bit is set to 1 for SmallIntegers, so this
> register must be set to hex
> # value 011EAAAA. The AAAA defines all odd values as valid
> SmallIntegers.  The E will clear
> # d0 when converting to raw bits and the bottom 1 will set it when
> retagging. The top 1 will divide
> # the tagged value by 2 and multiply it back when retagging.  For Self
> the bottom two bits are 0
> # for SmallIntegers, so this register must be set to hex value 020F1111.
> An option that works well
> # in hardware but is complicated to deal with in software is when the
> top two bits must match in
> # SmallIntegers. This can be handled by setting this register to hex
> value 000FF00F.
>
> For the 2016 version I decided this was a costly complication and just
> made it always use the last configuration (where the top two bits 00 or
> 11 indicate a SmallInteger). In software this is very complicated to do
> but in hardware it is just a single XOR gate). This meant converting
> when loading or saving V3 or Spur images, of course.
>
>> Putting the immediate floating point encode/decode sequences in hardware would be nice.
>
> With the D and F extensions you already have to let 32 and 64 bit floats
> share the same registers. So it wouldn't add much to let tagged floats
> be an option. One of the proposals is the L extension which would allow
> decimal floats. That is a lot more complicated.
>
>> But I'm intrigued by your statistics gathering suggestion.  It would be great to have
>> finer grained stats.  Although I wonder how valid those stats would be in a processor
>> with extensive support for OO.
>
> Hardly my idea - one of the RISC-V creators wrote a book about using a
> Quantative Approach to designs processors, after all. In fact, the
> various papers and thesis published for RISC-III (SOAR) are a great
> example of what we need to do. And it showed how following your
> intuition can have bad results. For example: all their instructions were
> tagged but in the end only the tagged ADD made the slightest difference.
> They also automatically filled registers with Nil but it didn't help.
>
> They didn't have the statistics gathering I want, however. What they did
> was to turn features on and off in their compiler and look at total
> execution time for a bunch of benchmarks.
>
> In my case it would be like running the benchmarks where we take
> advantage of using x15 to push/pop stuff and then run them again but
> with explicit stack pointer manipulation sequences. That just gives you
> an overall number.
>
> -- Jecel
Reply | Threaded
Open this post in threaded view
|

measurements and multiple bytecode sets (was: SiliconSqueak and RISC-V J Extension)

Jecel Assumpcao Jr
 
Eliot,

> Right, exactly.  We want to know what is cheap on current hardware
> and what is expensive, and what could be made cheaper.

The last one is not easy. That is my complaint about Urs Hölzle's 1995
ECOOP paper where he concluded that the special hardware in Sparc didn't
help Self. Part of his results were due to how costly traps were on
Sparc 8 and Solaris (more than 1000 clock cycles) so using it for
handling tags and register windows overflow/underflow wasn't a good
idea. He compared the code generated by his Self compiler with that
generated by C and didn't see any difference (hardly surprising since
Sparc was optimized for C and his compiler was optimized for Sparc). The
problem is that he couldn't measure the effect that any hardware
features Sparc didn't include would have if they were added. He could
make his compiler use or not register windows and report less then 1%
difference. But he couldn't guess what would happen if a PIC instruction
like I proposed were added.

http://www.cs.ucsb.edu/~urs/oocsb/papers/ecoop95-arch.pdf

> Right.  The Cog JIT generates abstract instructions in an assembly-like
> style.  We can add state that the abstract instructions easily.  So if we
> modified some generation routines to set a "context" flag, such as
> "doing allocation", "doing dispatch", "doing store check", "doing
> marshalling", etc, we could label each instruction in the sequences we're
> interested in with that flag.

Great idea! I had not considered looking at the Smalltalk code executing
while generating an instruction instead of looking at the generated
instructions. And I was thinking of Slang to C, which matters for the
Intepreter but not for code generated by Cog.

> Then, for example, when we generate we could reduce that flag to a
> set of bit vectors, one bit per byte of instruction space, one bit vector
> per "interesting code type", and one bit vector that ors them together.

If you have less than 8 interesting categories you could have one label
byte per one instruction byte. That would be easy but wasteful since
only the bits in the byte corresponding to the start of an instruction
would matter.

>  Then on simulating each instruction we can test its address in the bit
> vectors and find out what kind it is.  It will slow down simulation, but
> we're happy to pay the premium fees when we're gathering statistics.

You could increment counters corresponding to each of the bits.

> > About instruction counts, they are certainly very important even if less
> > helpful than cycle counts.
>
> Useful enough data for not too much effort.  Very not accurate cycle
> counts will be a lot more work and much harder set to prove correct.
>  In any case they depend on processor implementation and given the
> those implementations evolve quickly I have never thought of
> worthwhile doing micro measurements to find out what are fast
> instruction sequences on specific versions.  People do do this and get
> great results.  I've simply never worked in a situation where I felt I could
> afford the effort.

For my own project I want to focus on very simple pipelined
implementations, so instruction counts would be good enough. But for the
J Extension work group we would need to know how the proposals would
affect a more advanced implementation like BOOM (Berkeley Out of Order
Machine).

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-157.html

> > I have not yet looked at how multiple bytecodes set are handled in Cog.
>
> Claus's scheme is to maintain a single variable (at interpretation and JIT
> time) called bytecodeSetOffset, which has values 0, 256, 512, 768 etc,
> and this is added to the byte fetched.  bytecodeSetOffset must be set
> on activating a method and returning from one.  It is essentially the
> same idea as maintaining BCTableBase as a variable.

It would be trivial to convert one to the other (just add or subtract a
constant). In fact, if you arrange it so that the table is at address
zero in code space then they are the same.

-- Jecel
Reply | Threaded
Open this post in threaded view
|

Re: SiliconSqueak and RISC-V J Extension

David T. Lewis
In reply to this post by Jecel Assumpcao Jr
 
On Wed, Mar 28, 2018 at 06:25:29PM -0300, Jecel Assumpcao Jr. wrote:
>  
>
> It would be nice to have data to guide the design of these extensions
> before all this work, but the fact that critical operations are done via
> macros in the OpenSmalltalk VM (as they should!) instead of subroutines
> make it hard to know how much time is spent on PICs or allocating new
> objects, for example.
>

The assumption that cpp macros are faster is one of those things that
we all believe to be true because everyone says it is so.

It ain't necessarily so.

The memory access macros used for the lowest level mapping of object memory
to process memory space are defined in platforms/Cross/vm/sqMemoryAccess.h.
These macros are essential for performance because they affect the efficiency
of memory access operations at the lowest levels of object memory functions.

Wrong.

If you replace the cpp macros with high level Smalltalk (slang), and rely
on the inliner in CCodeGenertor to unroll the Smalltalk into inlined C code,
then the performance of the resulting interpreter is essentially the same
as that of the interpreter implemented with cpp macros.

This is important for two reasons:

1) By getting rid of the cpp macros, you open up the possibility of doing
profiling and debugging of low level functions directly with generated C code
that is not obscured by cpp macros.

2) If you want to look at time spent in individual functions that normally
would be inlined (either because they are cpp macros, or because the
slang inliner unrolled them in similar ways), then you can do so by
disabling the inlining during C code generation. This produces a very
slow VM, but one that can be easily profiled to see time spent in the
individual functions.

I have only ever measured this in the interpreter VM (see package
MemoryAccess in the VMMaker repository), but it would be reasonable to
expect similar results with oscog, at least with respect to the base
interpreter and primitive functions.

Dave
 
Reply | Threaded
Open this post in threaded view
|

macros (was: SiliconSqueak and RISC-V J Extension)

Jecel Assumpcao Jr
 
David,

> If you replace the cpp macros with high level Smalltalk (slang), and rely
> on the inliner in CCodeGenertor to unroll the Smalltalk into inlined C code,
> then the performance of the resulting interpreter is essentially the same
> as that of the interpreter implemented with cpp macros.
>
> This is important for two reasons:
>
> 1) By getting rid of the cpp macros, you open up the possibility of doing
> profiling and debugging of low level functions directly with generated C code
> that is not obscured by cpp macros.

I agree it is better, but in terms of measurement inlined functions and
macros are equivalent. But:

> 2) If you want to look at time spent in individual functions that normally
> would be inlined (either because they are cpp macros, or because the
> slang inliner unrolled them in similar ways), then you can do so by
> disabling the inlining during C code generation. This produces a very
> slow VM, but one that can be easily profiled to see time spent in the
> individual functions.

Such results are way better than nothing but I think they will be
distorted. When a function is inlined you can do additional optimization
so the resulting code is not the same as in the function but without the
call/return overhead.

> I have only ever measured this in the interpreter VM (see package
> MemoryAccess in the VMMaker repository), but it would be reasonable to
> expect similar results with oscog, at least with respect to the base
> interpreter and primitive functions.

Like I said, I am interested both in the interpreter and in compiled
code. Thanks for the tip!

-- Jecel