Hi , all RoarVM using object tables, while Squeak VM using direct pointers for object. This is a basic element of VM design, and i wonder, how much impact it does on overall VM speed. Both variants having own advantages/disadvantages, while i think, that with good JIT an extra indirection could be almost insignificant. But having indirect pointers to objects opens quite sexy perspectives. Being able to freely choose object(s) location means that: - its quite easy to implement object memory paging (swapping between memory and disc), - place particular object(s) into special memory location (good for FFI, object pinning etc) - #become: is O(1), instead of O(heap size) The downside of indirect pointers is, of course, a higher memory traffic, which directly impacts all operations everywhere. What else? I'd like to know, what you thinking about it, and why Squeak VM, in particular, using direct object pointers? What are this choice based on? I'd like to know. Maybe i missing something. -- Best regards, Igor Stasenko AKA sig. |
On 11/12/2010 12:05 AM, Igor Stasenko wrote: > Both variants having own advantages/disadvantages, while i think, that > with good JIT an extra indirection could be > almost insignificant. This ignores the cost of memory access. > I'd like to know, what you thinking about it, and why Squeak VM, in > particular, using direct object pointers? Performance. > What are this choice based on? I'd like to know. Maybe i missing something. http://ftp.squeak.org/docs/OOPSLA.Squeak.html: The Object Memory The design of an object memory that is general and yet compact is not simple. We all agreed immediately on a number of parameters, though. For efficiency and scalability to large projects, we wanted a 32-bit address space with direct pointers (i.e., a system in which an object reference is just the address of that object in memory). The design had to support all object formats of our existing Smalltalk. It must be amenable to incremental garbage collection and compaction. Finally, it must be able to support the "become" operation (exchange identity of two objects) to the degree required in normal Smalltalk system operation. "... etc ..." (also see the section on storage management) And if in doubt drop a note to dan ingalls at sap dot com and you'll get the answer straight from the source :-) Cheers, - Andreas |
Sorry, Andreas. Maybe i wasn't clear: I don't want brief answers, i need details. :) I would like to hear your opinion on that in context: what if you would design a VM from scratch, and having a direct access to high-optimization compiler/jit. What would be your choice? I read Squeak VM design descriprion before. Still it would be good to know , if an impact of indirect pointers can be measured and how. -- Best regards, Igor Stasenko AKA sig. |
Here a 'simulated' kind of indirect pointers. The Wrapper class having an 'object' ivar and in this way, we simulating indirection. | objects wrapped t1 t2 t3 | objects := (1 to: 1000) collect: [:i | Object new ]. wrapped := objects collect: [:each | Wrapper new object: each ]. t1 := [ 100000 timesRepeat: [ objects do:[ :each | each yourself ] ] ] timeToRun. t2 := [ 100000 timesRepeat: [ wrapped do:[ :each | each object yourself ] ] ] timeToRun. t3 := [ 100000 timesRepeat: [ wrapped do:[ :each | ] ] ] timeToRun. {t1. t2. t3} Running on Cog it gives: #(3241 3498 2793) the first bench is kind-of 'measure time to access directly to objects' the second one is 'measure indirect access' and third is measure a loop overhead. So, by taking this naive benchmark, we got: (3498 - 2793) / (3241 - 2793) asFloat 1.573660714285714 so, 50% slower. But actually this benchmarks shows a cost of extra message send rather than impact of extra level of indirection. Well, a message is a kind of indirection.. :) -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Igor Stasenko
Hi Igor: On 12 Nov 2010, at 10:32, Igor Stasenko wrote: > I would like to hear your opinion on that in context: what if you > would design a VM from scratch, > and having a direct access to high-optimization compiler/jit. What > would be your choice? I don't think you will get a satisfying answer to that question. It might be that on certain processors the caches are big enough to actually hide the overhead of an object table in such a scenario. But, by definition, caches are always to small. I think we have still the source code of David's version of the RoarVM lying around that does not use an object table. With 'a bit of work' it would be possible to measure that overhead for our interpreter for a single core. So, if you feel like it, I could give you a hand here and there ;) Best regards Stefan -- Stefan Marr Software Languages Lab Vrije Universiteit Brussel Pleinlaan 2 / B-1050 Brussels / Belgium http://soft.vub.ac.be/~smarr Phone: +32 2 629 2974 Fax: +32 2 629 3525 |
On 12 November 2010 11:59, Stefan Marr <[hidden email]> wrote: > > Hi Igor: > > On 12 Nov 2010, at 10:32, Igor Stasenko wrote: > >> I would like to hear your opinion on that in context: what if you >> would design a VM from scratch, >> and having a direct access to high-optimization compiler/jit. What >> would be your choice? > I don't think you will get a satisfying answer to that question. > It might be that on certain processors the caches are big enough to actually hide the overhead of an object table in such a scenario. > > But, by definition, caches are always to small. > > I think we have still the source code of David's version of the RoarVM lying around that does not use an object table. With 'a bit of work' it would be possible to measure that overhead for our interpreter for a single core. So, if you feel like it, I could give you a hand here and there ;) > > Best regards > Stefan > > > -- > Stefan Marr > Software Languages Lab > Vrije Universiteit Brussel > Pleinlaan 2 / B-1050 Brussels / Belgium > http://soft.vub.ac.be/~smarr > Phone: +32 2 629 2974 > Fax: +32 2 629 3525 > > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Igor Stasenko
On Fri, 12 Nov 2010, Igor Stasenko wrote: > > Here a 'simulated' kind of indirect pointers. > > The Wrapper class having an 'object' ivar > and in this way, we simulating indirection. > > > | objects wrapped t1 t2 t3 | > objects := (1 to: 1000) collect: [:i | Object new ]. > wrapped := objects collect: [:each | Wrapper new object: each ]. > > t1 := [ 100000 timesRepeat: [ objects do:[ :each | each yourself ] ] > ] timeToRun. > t2 := [ 100000 timesRepeat: [ wrapped do:[ :each | each object > yourself ] ] ] timeToRun. > t3 := [ 100000 timesRepeat: [ wrapped do:[ :each | ] ] ] timeToRun. > {t1. t2. t3} > > Running on Cog it gives: > > #(3241 3498 2793) Single measurement is probably inaccurate. This benchmark creates lots of blocks, which means GC noise. The size of the "object table" is too small to be realistic, this hides cache related performance hits. Why don't you use an Array instead of Wrapper? IIRC Cog has optimization for the #at: primitive, also Arrays are compact. Why do you send #yourself and #timesRepeat:? You should slightly shuffle the objects to be more realistic about cache usage. Levente > > the first bench is kind-of 'measure time to access directly to objects' > the second one is 'measure indirect access' > and third is measure a loop overhead. > > So, by taking this naive benchmark, we got: > > (3498 - 2793) / (3241 - 2793) asFloat > 1.573660714285714 > > so, 50% slower. > > But actually this benchmarks shows a cost of extra message send rather > than impact of extra level of indirection. > Well, a message is a kind of indirection.. :) > > -- > Best regards, > Igor Stasenko AKA sig. > |
On 12 November 2010 15:24, Levente Uzonyi <[hidden email]> wrote: > > On Fri, 12 Nov 2010, Igor Stasenko wrote: > >> >> Here a 'simulated' kind of indirect pointers. >> >> The Wrapper class having an 'object' ivar >> and in this way, we simulating indirection. >> >> >> | objects wrapped t1 t2 t3 | >> objects := (1 to: 1000) collect: [:i | Object new ]. >> wrapped := objects collect: [:each | Wrapper new object: each ]. >> >> t1 := [ 100000 timesRepeat: [ objects do:[ :each | each yourself ] ] >> ] timeToRun. >> t2 := [ 100000 timesRepeat: [ wrapped do:[ :each | each object >> yourself ] ] ] timeToRun. >> t3 := [ 100000 timesRepeat: [ wrapped do:[ :each | ] ] ] timeToRun. >> {t1. t2. t3} >> >> Running on Cog it gives: >> >> #(3241 3498 2793) > > Single measurement is probably inaccurate. This benchmark creates lots of > blocks, which means GC noise. The size of the "object table" is too small to > be realistic, this hides cache related performance hits. Why don't you use > an Array instead of Wrapper? IIRC Cog has optimization for the #at: > primitive, also Arrays are compact. Why do you send #yourself and > #timesRepeat:? You should slightly shuffle the objects to be more realistic > about cache usage. > I'm inviting you to make own version of benchmark, which could simulate an extra level of indirection for accessing object field(s). actually i was trying: t1 := [ 100000 timesRepeat: [ objects do:[ :each | each yourself ] ] vs t2 := [ 100000 timesRepeat: [ objects do:[ :each | each object ] ] where #yourself there is to compensate an extra message send (#object), so it will compare 'read nothing, return self' , and 'read ivar', which is an indirection. But what i found that this gives no difference, and actually t2 < t1 sometimes :) > > Levente > >> >> the first bench is kind-of 'measure time to access directly to objects' >> the second one is 'measure indirect access' >> and third is measure a loop overhead. >> >> So, by taking this naive benchmark, we got: >> >> (3498 - 2793) / (3241 - 2793) asFloat >> 1.573660714285714 >> >> so, 50% slower. >> >> But actually this benchmarks shows a cost of extra message send rather >> than impact of extra level of indirection. >> Well, a message is a kind of indirection.. :) >> >> -- >> Best regards, >> Igor Stasenko AKA sig. >> > -- Best regards, Igor Stasenko AKA sig. |
On 12.11.2010, at 14:41, Igor Stasenko wrote: > I'm inviting you to make own version of benchmark I don't think this can be realistically simulated inside Squeak. But possibly you could change the macros in sqMemoryAccess.h to fake an object table access? I just tried that. Using tinyBenchmarks, byte code performance drops to 63% and sends to 78%. Now declaring that variable volatile might be overkill as it prevents all caching, but I couldn't quite figure out a more realistic declaration. - Bert - #else # ifndef FAKE_OBJ_TABLE # define FAKE_OBJ_TABLE static volatile int FakeObjTable= 0; # define OBJTABLELOOKUP(oop) (oop + FakeObjTable) # endif /* Use macros when static inline functions aren't efficient. */ # define byteAtPointer(ptr) ((sqInt)(*((unsigned char *)(OBJTABLELOOKUP(ptr))))) # define byteAtPointerput(ptr, val) ((sqInt)(*((unsigned char *)(OBJTABLELOOKUP(ptr)))= (unsigned char)(val))) # define shortAtPointer(ptr) ((sqInt)(*((short *)(OBJTABLELOOKUP(ptr))))) # define shortAtPointerput(ptr, val) ((sqInt)(*((short *)(OBJTABLELOOKUP(ptr)))= (short)(val))) # define intAtPointer(ptr) ((sqInt)(*((unsigned int *)(OBJTABLELOOKUP(ptr))))) # define intAtPointerput(ptr, val) ((sqInt)(*((unsigned int *)(OBJTABLELOOKUP(ptr)))= (int)(val))) # define longAtPointer(ptr) ((sqInt)(*((sqInt *)(OBJTABLELOOKUP(ptr))))) # define longAtPointerput(ptr, val) ((sqInt)(*((sqInt *)(OBJTABLELOOKUP(ptr)))= (sqInt)(val))) # define oopAtPointer(ptr) (sqInt)(*((sqInt *)OBJTABLELOOKUP(ptr))) # define oopAtPointerput(ptr, val) (sqInt)(*((sqInt *)OBJTABLELOOKUP(ptr))= (sqInt)val) # define pointerForOop(oop) ((char *)(sqMemoryBase + ((usqInt)(oop)))) # define oopForPointer(ptr) ((sqInt)(((char *)(ptr)) - (sqMemoryBase))) # define byteAt(oop) byteAtPointer(pointerForOop(oop)) # define byteAtput(oop, val) byteAtPointerput(pointerForOop(oop), (val)) # define shortAt(oop) shortAtPointer(pointerForOop(oop)) # define shortAtput(oop, val) shortAtPointerput(pointerForOop(oop), (val)) # define longAt(oop) longAtPointer(pointerForOop(oop)) # define longAtput(oop, val) longAtPointerput(pointerForOop(oop), (val)) # define intAt(oop) intAtPointer(pointerForOop(oop)) # define intAtput(oop, val) intAtPointerput(pointerForOop(oop), (val)) # define oopAt(oop) oopAtPointer(pointerForOop(oop)) # define oopAtput(oop, val) oopAtPointerput(pointerForOop(oop), (val)) #endif |
In reply to this post by Igor Stasenko
Igor, those of us who design our own hardware have options that are not available when using conventional processors. In the case of object tables, we can use virtually addressed object caches (invented in the Mushroom project - http://www.wolczko.com/mushroom/index.html) to eliminate most of the cost. In a conventional processor, think about what happens when we execute an instruction like load R3, R7, R1 where R1 has the number of the instance variable we want to read (multiplied by the word size, depending on the processor), R7 is the oop for the object and R3 will store the value of the instance variable. The first step is that R7 and R1 are added and the result is the virtual address of the instance variable. Then the top (20 or so) bits will be searched in the TLB (translation look-aside buffer) of the MMU (memory management unit) and, if found there, they will be replaced with the associated bits, forming the physical address of the instance variable. The last step is that the top bits of the physical address (28 bits in the case of a cache with lines of 16 bytes) are used to find the right line in the data cache and the bottom bits will select the bytes from that line to be loaded into R3. Of course, sometimes the "page" isn't in the TLB or the data cache doesn't have the needed line, but let's not worry about that for now. Imagine that we redesign our processor so that same instruction will work like this: we concatenate R7 and R1 into a 64 bit virtual instance variable address and use the top 60 bits to find the right line in the data cache, and the bottom 4 bits to select the bytes from that line to be loaded into R3. We have saved one addition and one MMU lookup at the cost of a larger tag for the cache. An additional cost is that two objects can't share the same cache line like they can in the conventional processor, but that doesn't hurt much. When we can't find the cache line we need, we have to bring in data from the main memory. That can be done by adding R7 and R1, masking the bottom 4 bits, doing the MMU lookup and fetching the 16 bytes from the result. This will be compatible with the direct pointer Squeak. But we could instead use R7 as an index into an object table, fetch the base address, add R1 to that, mask the bottom 4 bits, do a MMU lookup (or not - the object table itself could double as a virtual memory system) and fetch the 16 bytes into the new cache line. Since cache misses are rare, the extra memory access here does not impact performance very much. Note that virtual caches are considered a bad thing in the C world because of aliasing problems: two virtual addresses might map to the same physical address and then you could have two copies of the same data in the cache and no way to keep them consistent. With object addressing, this is much easier to avoid. -- Jecel |
In reply to this post by Igor Stasenko
On Fri, Nov 12, 2010 at 11:58:59AM +0200, Igor Stasenko wrote: > > Here a 'simulated' kind of indirect pointers. > > The Wrapper class having an 'object' ivar > and in this way, we simulating indirection. If you want to measure the effects of an extra level of indirection at a low level, you may want to try hacking the MemoryAccess slang version of the C macros. These implement low level memory access at the level of pointerForOop and oopForPointer and such. On an intepreter VM they run at the same speed as the actual C macros, to the best of my ability to measure (this is due to the effectiveness of the Slang inliner, which works suprisingly well). If you use class MemoryAccess as a pattern (either write your own, or modify this one), then you might be able to do experiments like this entirely at the C level and maybe give you a better idea of the tradeoffs. The code is in SqS/VMMaker in package MemoryAccess. A few changes are needing in C headers in the platforms source, which can be found here: http://wiki.squeak.org/squeak/uploads/6081/MemoryAccessPlatformDiffs.zip along with some explanations: http://wiki.squeak.org/squeak/6081 HTH, Dave |
On Fri, Nov 12, 2010 at 02:03:44PM -0500, David T. Lewis wrote: > > If you want to measure the effects of an extra level of indirection at a > low level, you may want to try hacking the MemoryAccess slang version of > the C macros. Oops, sorry, I didn't notice that Bert had already done this experiment and posted the results: On Fri, Nov 12, 2010 at 04:44:42PM +0100, Bert Freudenberg wrote: > > I don't think this can be realistically simulated inside Squeak. But > possibly you could change the macros in sqMemoryAccess.h t o fake an > object table access? > > I just tried that. Using tinyBenchmarks, byte code performance drops > to 63% and sends to 78%. > > Now declaring that variable volatile might be overkill as it prevents all > caching, but I couldn't quite figure out a more realistic declaration. > > - Bert - |
In reply to this post by Bert Freudenberg
On 12 November 2010 17:44, Bert Freudenberg <[hidden email]> wrote: > > On 12.11.2010, at 14:41, Igor Stasenko wrote: > >> I'm inviting you to make own version of benchmark > > I don't think this can be realistically simulated inside Squeak. But possibly you could change the macros in sqMemoryAccess.h to fake an object table access? > > I just tried that. Using tinyBenchmarks, byte code performance drops to 63% and sends to 78%. > > Now declaring that variable volatile might be overkill as it prevents all caching, but I couldn't quite figure out a more realistic declaration. > int FakeObjTable = 0; could be optimized away by compiler? Well, since compiler compiles module by module (a separate C files), if you remove 'static' it can no longer able to optimize it to no-op, since it can't guess what may happen to this variable in another object file, since even if in one module there only a read-only access to it, some other module could contain a code which modifying it. So, i think this is the worst case performance slowdown. :) If we take into account that to get object location you need to do object table look only once, and then any consequent read/write operations on object won't require table lookup, this can be improved. Consider, for example, that to read ivar, interpreter reads & checks header, and only then ivar slot, so it should cost: 1 table lookup and 2 reads at object location. instead of 2 table lookups + 2 reads at object location. > - Bert - > > #else > # ifndef FAKE_OBJ_TABLE > # define FAKE_OBJ_TABLE > static volatile int FakeObjTable= 0; > # define OBJTABLELOOKUP(oop) (oop + FakeObjTable) > # endif > /* Use macros when static inline functions aren't efficient. */ > # define byteAtPointer(ptr) ((sqInt)(*((unsigned char *)(OBJTABLELOOKUP(ptr))))) > # define byteAtPointerput(ptr, val) ((sqInt)(*((unsigned char *)(OBJTABLELOOKUP(ptr)))= (unsigned char)(val))) > # define shortAtPointer(ptr) ((sqInt)(*((short *)(OBJTABLELOOKUP(ptr))))) > # define shortAtPointerput(ptr, val) ((sqInt)(*((short *)(OBJTABLELOOKUP(ptr)))= (short)(val))) > # define intAtPointer(ptr) ((sqInt)(*((unsigned int *)(OBJTABLELOOKUP(ptr))))) > # define intAtPointerput(ptr, val) ((sqInt)(*((unsigned int *)(OBJTABLELOOKUP(ptr)))= (int)(val))) > # define longAtPointer(ptr) ((sqInt)(*((sqInt *)(OBJTABLELOOKUP(ptr))))) > # define longAtPointerput(ptr, val) ((sqInt)(*((sqInt *)(OBJTABLELOOKUP(ptr)))= (sqInt)(val))) > # define oopAtPointer(ptr) (sqInt)(*((sqInt *)OBJTABLELOOKUP(ptr))) > # define oopAtPointerput(ptr, val) (sqInt)(*((sqInt *)OBJTABLELOOKUP(ptr))= (sqInt)val) > # define pointerForOop(oop) ((char *)(sqMemoryBase + ((usqInt)(oop)))) > # define oopForPointer(ptr) ((sqInt)(((char *)(ptr)) - (sqMemoryBase))) > # define byteAt(oop) byteAtPointer(pointerForOop(oop)) > # define byteAtput(oop, val) byteAtPointerput(pointerForOop(oop), (val)) > # define shortAt(oop) shortAtPointer(pointerForOop(oop)) > # define shortAtput(oop, val) shortAtPointerput(pointerForOop(oop), (val)) > # define longAt(oop) longAtPointer(pointerForOop(oop)) > # define longAtput(oop, val) longAtPointerput(pointerForOop(oop), (val)) > # define intAt(oop) intAtPointer(pointerForOop(oop)) > # define intAtput(oop, val) intAtPointerput(pointerForOop(oop), (val)) > # define oopAt(oop) oopAtPointer(pointerForOop(oop)) > # define oopAtput(oop, val) oopAtPointerput(pointerForOop(oop), (val)) > #endif > > -- Best regards, Igor Stasenko AKA sig. |
On 13.11.2010, at 03:17, Igor Stasenko wrote: > > On 12 November 2010 17:44, Bert Freudenberg <[hidden email]> wrote: >> >> On 12.11.2010, at 14:41, Igor Stasenko wrote: >> >>> I'm inviting you to make own version of benchmark >> >> I don't think this can be realistically simulated inside Squeak. But possibly you could change the macros in sqMemoryAccess.h to fake an object table access? >> >> I just tried that. Using tinyBenchmarks, byte code performance drops to 63% and sends to 78%. >> >> Now declaring that variable volatile might be overkill as it prevents all caching, but I couldn't quite figure out a more realistic declaration. >> > you mean that non-volatile like: > > int FakeObjTable = 0; > > could be optimized away by compiler? > Well, since compiler compiles module by module (a separate C files), > if you remove 'static' > it can no longer able to optimize it to no-op, since it can't guess > what may happen to this variable in another object file, > since even if in one module there only a read-only access to it, some > other module could contain a code which modifying it. Yes but that wouldn't build because the linker would complain about FakeObjTable being declared globally more than once. The better way would be to declare it external in the header. But then you need to declare it for real in a single C file. interp.c would be a good one. You should try that :) > So, i think this is the worst case performance slowdown. :) I think so, too. It is significant. > If we take into account that to get object location you need to do > object table look only once, > and then any consequent read/write operations on object won't require > table lookup, this can be improved. > Consider, for example, that to read ivar, interpreter reads & checks > header, and only then ivar slot, so it should cost: > 1 table lookup and 2 reads at object location. > instead of 2 table lookups + 2 reads at object location. But that's no different from what we have now. We only access memory if necessary. There would not be fewer lookups if you have an object table. - Bert - > >> - Bert - >> >> #else >> # ifndef FAKE_OBJ_TABLE >> # define FAKE_OBJ_TABLE >> static volatile int FakeObjTable= 0; >> # define OBJTABLELOOKUP(oop) (oop + FakeObjTable) >> # endif >> /* Use macros when static inline functions aren't efficient. */ >> # define byteAtPointer(ptr) ((sqInt)(*((unsigned char *)(OBJTABLELOOKUP(ptr))))) >> # define byteAtPointerput(ptr, val) ((sqInt)(*((unsigned char *)(OBJTABLELOOKUP(ptr)))= (unsigned char)(val))) >> # define shortAtPointer(ptr) ((sqInt)(*((short *)(OBJTABLELOOKUP(ptr))))) >> # define shortAtPointerput(ptr, val) ((sqInt)(*((short *)(OBJTABLELOOKUP(ptr)))= (short)(val))) >> # define intAtPointer(ptr) ((sqInt)(*((unsigned int *)(OBJTABLELOOKUP(ptr))))) >> # define intAtPointerput(ptr, val) ((sqInt)(*((unsigned int *)(OBJTABLELOOKUP(ptr)))= (int)(val))) >> # define longAtPointer(ptr) ((sqInt)(*((sqInt *)(OBJTABLELOOKUP(ptr))))) >> # define longAtPointerput(ptr, val) ((sqInt)(*((sqInt *)(OBJTABLELOOKUP(ptr)))= (sqInt)(val))) >> # define oopAtPointer(ptr) (sqInt)(*((sqInt *)OBJTABLELOOKUP(ptr))) >> # define oopAtPointerput(ptr, val) (sqInt)(*((sqInt *)OBJTABLELOOKUP(ptr))= (sqInt)val) >> # define pointerForOop(oop) ((char *)(sqMemoryBase + ((usqInt)(oop)))) >> # define oopForPointer(ptr) ((sqInt)(((char *)(ptr)) - (sqMemoryBase))) >> # define byteAt(oop) byteAtPointer(pointerForOop(oop)) >> # define byteAtput(oop, val) byteAtPointerput(pointerForOop(oop), (val)) >> # define shortAt(oop) shortAtPointer(pointerForOop(oop)) >> # define shortAtput(oop, val) shortAtPointerput(pointerForOop(oop), (val)) >> # define longAt(oop) longAtPointer(pointerForOop(oop)) >> # define longAtput(oop, val) longAtPointerput(pointerForOop(oop), (val)) >> # define intAt(oop) intAtPointer(pointerForOop(oop)) >> # define intAtput(oop, val) intAtPointerput(pointerForOop(oop), (val)) >> # define oopAt(oop) oopAtPointer(pointerForOop(oop)) >> # define oopAtput(oop, val) oopAtPointerput(pointerForOop(oop), (val)) >> #endif >> >> > > > > -- > Best regards, > Igor Stasenko AKA sig. |
In reply to this post by Igor Stasenko
On Sat, 13 Nov 2010, Igor Stasenko wrote: > > On 12 November 2010 17:44, Bert Freudenberg <[hidden email]> wrote: >> >> On 12.11.2010, at 14:41, Igor Stasenko wrote: >> >>> I'm inviting you to make own version of benchmark >> >> I don't think this can be realistically simulated inside Squeak. But possibly you could change the macros in sqMemoryAccess.h to fake an object table access? >> >> I just tried that. Using tinyBenchmarks, byte code performance drops to 63% and sends to 78%. >> >> Now declaring that variable volatile might be overkill as it prevents all caching, but I couldn't quite figure out a more realistic declaration. >> > you mean that non-volatile like: > > int FakeObjTable = 0; > > could be optimized away by compiler? > Well, since compiler compiles module by module (a separate C files), > if you remove 'static' > it can no longer able to optimize it to no-op, since it can't guess > what may happen to this variable in another object file, > since even if in one module there only a read-only access to it, some > other module could contain a code which modifying it. > > So, i think this is the worst case performance slowdown. :) > > If we take into account that to get object location you need to do > object table look only once, > and then any consequent read/write operations on object won't require > table lookup, this can be improved. Levente > Consider, for example, that to read ivar, interpreter reads & checks > header, and only then ivar slot, so it should cost: > 1 table lookup and 2 reads at object location. > instead of 2 table lookups + 2 reads at object location. > >> - Bert - >> >> #else >> # ifndef FAKE_OBJ_TABLE >> # define FAKE_OBJ_TABLE >> static volatile int FakeObjTable= 0; >> # define OBJTABLELOOKUP(oop) (oop + FakeObjTable) >> # endif >> /* Use macros when static inline functions aren't efficient. */ >> # define byteAtPointer(ptr) ((sqInt)(*((unsigned char *)(OBJTABLELOOKUP(ptr))))) >> # define byteAtPointerput(ptr, val) ((sqInt)(*((unsigned char *)(OBJTABLELOOKUP(ptr)))= (unsigned char)(val))) >> # define shortAtPointer(ptr) ((sqInt)(*((short *)(OBJTABLELOOKUP(ptr))))) >> # define shortAtPointerput(ptr, val) ((sqInt)(*((short *)(OBJTABLELOOKUP(ptr)))= (short)(val))) >> # define intAtPointer(ptr) ((sqInt)(*((unsigned int *)(OBJTABLELOOKUP(ptr))))) >> # define intAtPointerput(ptr, val) ((sqInt)(*((unsigned int *)(OBJTABLELOOKUP(ptr)))= (int)(val))) >> # define longAtPointer(ptr) ((sqInt)(*((sqInt *)(OBJTABLELOOKUP(ptr))))) >> # define longAtPointerput(ptr, val) ((sqInt)(*((sqInt *)(OBJTABLELOOKUP(ptr)))= (sqInt)(val))) >> # define oopAtPointer(ptr) (sqInt)(*((sqInt *)OBJTABLELOOKUP(ptr))) >> # define oopAtPointerput(ptr, val) (sqInt)(*((sqInt *)OBJTABLELOOKUP(ptr))= (sqInt)val) >> # define pointerForOop(oop) ((char *)(sqMemoryBase + ((usqInt)(oop)))) >> # define oopForPointer(ptr) ((sqInt)(((char *)(ptr)) - (sqMemoryBase))) >> # define byteAt(oop) byteAtPointer(pointerForOop(oop)) >> # define byteAtput(oop, val) byteAtPointerput(pointerForOop(oop), (val)) >> # define shortAt(oop) shortAtPointer(pointerForOop(oop)) >> # define shortAtput(oop, val) shortAtPointerput(pointerForOop(oop), (val)) >> # define longAt(oop) longAtPointer(pointerForOop(oop)) >> # define longAtput(oop, val) longAtPointerput(pointerForOop(oop), (val)) >> # define intAt(oop) intAtPointer(pointerForOop(oop)) >> # define intAtput(oop, val) intAtPointerput(pointerForOop(oop), (val)) >> # define oopAt(oop) oopAtPointer(pointerForOop(oop)) >> # define oopAtput(oop, val) oopAtPointerput(pointerForOop(oop), (val)) >> #endif >> >> > > > > -- > Best regards, > Igor Stasenko AKA sig. > |
2010/11/13 Levente Uzonyi <[hidden email]>: > > > > On Sat, 13 Nov 2010, Igor Stasenko wrote: > >> >> On 12 November 2010 17:44, Bert Freudenberg <[hidden email]> wrote: >>> >>> On 12.11.2010, at 14:41, Igor Stasenko wrote: >>> >>>> I'm inviting you to make own version of benchmark >>> >>> I don't think this can be realistically simulated inside Squeak. But possibly you could change the macros in sqMemoryAccess.h to fake an object table access? >>> >>> I just tried that. Using tinyBenchmarks, byte code performance drops to 63% and sends to 78%. >>> >>> Now declaring that variable volatile might be overkill as it prevents all caching, but I couldn't quite figure out a more realistic declaration. >>> >> you mean that non-volatile like: >> >> int FakeObjTable = 0; >> >> could be optimized away by compiler? >> Well, since compiler compiles module by module (a separate C files), >> if you remove 'static' >> it can no longer able to optimize it to no-op, since it can't guess >> what may happen to this variable in another object file, >> since even if in one module there only a read-only access to it, some >> other module could contain a code which modifying it. >> >> So, i think this is the worst case performance slowdown. :) >> >> If we take into account that to get object location you need to do >> object table look only once, >> and then any consequent read/write operations on object won't require >> table lookup, this can be improved. > > You can't do that if you want O(1) time for #become:. > For instance, lets take a bytecode read. Should each bytecode read also go through object table? > > Levente > >> Consider, for example, that to read ivar, interpreter reads & checks >> header, and only then ivar slot, so it should cost: >> 1 table lookup and 2 reads at object location. >> instead of 2 table lookups + 2 reads at object location. >> >>> - Bert - >>> -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Igor Stasenko
LOL! Back in 1981 or -2 I built the first Smalltalk system WITHOUT an object table. It used 32-bit direct pointers and Generation Scavenging (which I "invented"). First Smalltalk VM with direct pointers, first with generational GC, first with 32-bit OOPS. It was called "Berkeley Smalltalk" or BS. Peter (Deutsch) bet me a dinner about how much eliminating the OT would speed things up, and when I surpassed Peter's estimate (I think it was 1.4), I collected one of the best dinners I have ever had. Soon after, PS (Deustch & Schiffman) ran rings around BS, but that's another story. - David On Nov 12, 2010, at 2:56 AM, Igor Stasenko wrote: > > On 12 November 2010 11:59, Stefan Marr <[hidden email]> wrote: >> >> Hi Igor: >> >> On 12 Nov 2010, at 10:32, Igor Stasenko wrote: >> >>> I would like to hear your opinion on that in context: what if you >>> would design a VM from scratch, >>> and having a direct access to high-optimization compiler/jit. What >>> would be your choice? >> I don't think you will get a satisfying answer to that question. >> It might be that on certain processors the caches are big enough to actually hide the overhead of an object table in such a scenario. >> >> But, by definition, caches are always to small. >> >> I think we have still the source code of David's version of the RoarVM lying around that does not use an object table. With 'a bit of work' it would be possible to measure that overhead for our interpreter for a single core. So, if you feel like it, I could give you a hand here and there ;) >> > well, if that's not too much work to run a simple benchmarks :) > >> Best regards >> Stefan >> >> >> -- >> Stefan Marr >> Software Languages Lab >> Vrije Universiteit Brussel >> Pleinlaan 2 / B-1050 Brussels / Belgium >> http://soft.vub.ac.be/~smarr >> Phone: +32 2 629 2974 >> Fax: +32 2 629 3525 >> >> > > > > -- > Best regards, > Igor Stasenko AKA sig. |
In reply to this post by Igor Stasenko
Hi there, I've just arrived to this thread (thanks to Mariano), and I wanted to share some speculations: Having JIT'ed code with self (the oop of the actual object) in a register, and selfID (the id of self in the object table) in a second register. We have: accessing ivar: no extra cost method lookup: one extra indirection sends with MonomorphicInlineCache: no extra cost if implemented in an instance basis (checking against selfID). One indirection otherwise GC (MarkAndCompact): Faster (due to the removal of the threading process). saludos jb |
On 26 October 2011 23:11, Javier Burroni <[hidden email]> wrote: > > > Igor Stasenko wrote: >> >> the first bench is kind-of 'measure time to access directly to objects' >> the second one is 'measure indirect access' >> and third is measure a loop overhead. >> >> > > Hi there, > I've just arrived to this thread (thanks to Mariano), and I wanted to share > some speculations: > Having JIT'ed code with self (the oop of the actual object) in a register, > and selfID (the id of self in the object table) in a second register. yes, but then i will ask you to compare results with JIT optimized for direct pointers.. :) > We have: > accessing ivar: no extra cost > method lookup: > one extra indirection > sends with MonomorphicInlineCache: > no extra cost if implemented in an instance basis (checking against selfID). hmm.. that doesn't makes inline cache to be effective. usually many different objects are passing via single call site but they having same class, this is where monomophic IC shines. if you change the cache to work on per-instance basis, i think it will make it less effective because of too much misses. > One indirection otherwise > > GC (MarkAndCompact): > Faster (due to the removal of the threading process). > yes, GC is faster because you don't need to rewrite pointers in each object, with object table, when you moving object(s) you only need to change the pointer in object table and you done. > saludos > jb > > > -- > View this message in context: http://forum.world.st/Direct-object-pointers-vs-indirect-ones-pros-and-cons-tp3039203p3942123.html > Sent from the Squeak VM mailing list archive at Nabble.com. > -- Best regards, Igor Stasenko. |
> > yes, but then i will ask you to compare results with JIT optimized for > direct pointers.. :) > >> We have: >> accessing ivar: no extra cost >> method lookup: >> one extra indirection >> sends with MonomorphicInlineCache: >> no extra cost if implemented in an instance basis (checking against selfID). > > hmm.. that doesn't makes inline cache to be effective. > usually many different objects are passing via single call site but > they having same class, this is where monomophic IC shines. > if you change the cache to work on per-instance basis, i think it will > make it less effective because of too much misses. In the jited prologue you may have something like: mov [objectTable + selfID], self cmp [self -4], nativizedClass jz endOfPrologue // patching code must be added here jmp looupAndJIT cmp selfID, nativizedSelfID <- entry point jnz cmpClass mov nativizedSelf, self endOfPrologue you add (mainly) an extra memory access, if the branch predictor helps -- " To be is to do " ( Socrates ) " To be or not to be " ( Shakespeare ) " To do is to be " ( Sartre ) " Do be do be do " ( Sinatra ) |
Free forum by Nabble | Edit this page |