Tagging

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Tagging

Jason Johnson-5
Hello all,

As I have mentioned before, I am thinking about creating a new VM for testing some of my ideas.  One thing I noticed in the blue book was the tagging they use on the first (least significant) bit.  By doing this you always have to shift the number before you can work with it.

My thought was moving the tag to the last bit (most significant) and making 0 mean SmallPositiveInteger and 1 for OOP's.  This way, once the number is identified to be a number nothing more need be done to use it and it doesn't need to be converted to be put back.

As far as negative numbers, if I use a 31 bit header for my OOPS then the first bit of the header can be a flag.  If the flag is 1 it's actually a negative number, otherwise it's a normal header.

To clarify what I mean, here are some examples (first half byte in binary, rest in hex)

0000-0 00 00 ff - SmallPositiveInteger(255)
0111-f ff ff ff - SmallPositiveInteger(2147483647)

1xxx-x xx xx 01 - OOP(random memory address)
1xxx-x xx xx 02 - OOP(random memory address)
1xxx-x xx xx 03 - OOP(random memory address)

(at address 0xxx-x xx xx 01)
1000-0 00 00 00 - SmallNegativeInteger(-2147483647)

(at address 0xxx-x xx xx 02)
1111-f ff ff ff - SmallNegativeInteger(-1)

(at address 0xxx-x xx xx 03)
0xxx-x xx xx xx - OO Header for a regular object

So does anyone know any problems with this approach that make it unusable/unpractical?  The only things I see are:

1) This representation is tied to 2's compliment, but are there any system that don't use that?
2) This system can only deal with 2 gigs worth of address pointers (can be hard coded to high or low but not both).  How does any Smalltalk get around this?
3) Negative numbers require indirect memory access.  Are negative numbers used so much this would be a problem?  If so, one option would be to switch from 31 bits to 30 and use 2 bits for the tag (00 = SmallPositiveInteger, 11 = SmallNegativeInteger, 01 = Float?)  I think VW is using two tag bits.
4) By tagging on the high side I have no option of using a value less then 32 bits (e.g. an 8-byte char).  Perhaps characters can be done as my initial negative number solution as well by having a special OO header to represent them.  Since the pointer points to the top, not the bottom maybe the header could be variable size ( i.e. less then 32 bits).

So what do you all think?  Any big problems I didn't mention?  Or solutions to the problems I mentioned?  Or papers about this kind of thing (since my google-fu has not found too much on the subject so far)?

Thanks in advance,
Jason
_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Reply | Threaded
Open this post in threaded view
|

Re: Tagging

Andrew Gaylard
On 8/4/07, Jason Johnson <[hidden email]> wrote:

So what do you all think?  Any big problems I didn't mention?  Or solutions to the problems I mentioned?  Or papers about this kind of thing (since my google-fu has not found too much on the subject so far)?

There's a trade-off here:  either small-integer arithmetic
is made a little slower, or access to all objects is made
a little slower.  Nothing's free (unless an entirely different
scheme is proposed).

The reason for the current system is that it was built to
run on the Motorola M68000 series of CPUs which were
used in Apple's machines of the time.  These CPUs, like
most today, required aligned access, meaning that to
read or write a 4-byte integer value in memory, it has to
be stored at an address that is a multiple of 4.

Actually, I seem to remember that the M68k required an
alignment of only 2 bytes, but I could be wrong.  Anyway.
Suffice to say that to access an n-byte sized integer or
floating value on modern hardware, it needs to be stored
at a modulo-n address.  The point is that all modern chips
require alignment, with the exception of the x86 series,
which copes with mis-aligned data but is considerably
slower than if it were aligned (something like 10 times,
ISTR).  Check out
http://en.wikipedia.org/wiki/Data_structure_alignment

So for a 32-bit machine, we want to store pointer values
(OOPs) or numeric values in registers and memory.
Now, the hardware makes it a requirement for the lowest
2 bits to be zero for pointers to aligned (32-bit) data.
So already, pointers are "tagged" by the hardware: any
value with the bottom bit set is by definition not a valid
pointer.  The authors of squeak spotted this property
and came up with the current scheme.

There's another point to keep in mind with your scheme:
what about address space?  On my SPARC box,
pointers can be smallish values:

apg@breakfast: ~ cat t.c    
#include <stdio.h>
#include <stdlib.h>
int main( int argc, char * argv[] )
{
        char * p = (char*) malloc( 23 );
        printf( "Address is %p\n", p );
}

apg@breakfast: ~ gcc t.c
apg@breakfast: ~ ./a.out
Address is 20d60

However, I'm pretty sure that this number could just as
easily be enormous, but your scheme would limit it to
2G of address space.  It would also require setting the
top-most bit on OOP-creation, testing it on access,
and converting to a memory address.  A test and shift
for integers sounds simpler and cheaper (to me anyway).

The current scheme can provide 31-bit integers as well
as a 4GB address space for other objects, and at a
pretty low cost.  And 4GB is already a bit of a limitation --
I was involved in a project which needed squeak to
work with 6+GB of objects.  Anyone know how well
the 64-bit Squeak works these days?

Andrew



_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Reply | Threaded
Open this post in threaded view
|

Tagging

Bryce Kampjes
In reply to this post by Jason Johnson-5
Jason Johnson writes:
 > Hello all,
 >
 > As I have mentioned before, I am thinking about creating a new VM for
 > testing some of my ideas.  One thing I noticed in the blue book was the
 > tagging they use on the first (least significant) bit.  By doing this you
 > always have to shift the number before you can work with it.

You don't need to shift the numbers for addition and subtraction, just
make sure the bottom bits are 0. Even with the tag being 1, you
only need to detag one argument and if that's a constant the
detagging can be done at compile time.

With multiplication and division only one argument needs to be
shifted.

 > My thought was moving the tag to the last bit (most significant) and making
 > 0 mean SmallPositiveInteger and 1 for OOP's.  This way, once the number is
 > identified to be a number nothing more need be done to use it and it doesn't
 > need to be converted to be put back.
 >
 > As far as negative numbers, if I use a 31 bit header for my OOPS then the
 > first bit of the header can be a flag.  If the flag is 1 it's actually a
 > negative number, otherwise it's a normal header.
 >
 > To clarify what I mean, here are some examples (first half byte in binary,
 > rest in hex)
 >
 > 0000-0 00 00 ff - SmallPositiveInteger(255)
 > 0111-f ff ff ff - SmallPositiveInteger(2147483647)
 >
 > 1xxx-x xx xx 01 - OOP(random memory address)
 > 1xxx-x xx xx 02 - OOP(random memory address)
 > 1xxx-x xx xx 03 - OOP(random memory address)
 >
 > (at address 0xxx-x xx xx 01)
 > 1000-0 00 00 00 - SmallNegativeInteger(-2147483647)
 >
 > (at address 0xxx-x xx xx 02)
 > 1111-f ff ff ff - SmallNegativeInteger(-1)
 >
 > (at address 0xxx-x xx xx 03)
 > 0xxx-x xx xx xx - OO Header for a regular object
 >
 > So does anyone know any problems with this approach that make it
 > unusable/unpractical?  The only things I see are:

It was used by some Lisp systems in the mid 80's. I think CMU Common
Lisp from memory.

There are two disadvantages of using the top bit:
1) You need to control the address space to make sure all objects
live. Controlling the address space to that degree is unlikely to
be something that can be done portably.
2) You loose half your address space.

 > 1) This representation is tied to 2's compliment, but are there any system
 > that don't use that?
 > 2) This system can only deal with 2 gigs worth of address pointers (can be
 > hard coded to high or low but not both).  How does any Smalltalk get around
 > this?

Alignment. Words are aligned to 4 byte boundaries which is always
required for performance and required by some CPU architectures.

 > 3) Negative numbers require indirect memory access.  Are negative numbers
 > used so much this would be a problem?  If so, one option would be to switch
 > from 31 bits to 30 and use 2 bits for the tag (00 = SmallPositiveInteger, 11
 > = SmallNegativeInteger, 01 = Float?)  I think VW is using two tag bits.
 > 4) By tagging on the high side I have no option of using a value less then
 > 32 bits (e.g. an 8-byte char).  Perhaps characters can be done as my initial
 > negative number solution as well by having a special OO header to represent
 > them.  Since the pointer points to the top, not the bottom maybe the header
 > could be variable size (i.e. less then 32 bits).
 >
 > So what do you all think?  Any big problems I didn't mention?  Or solutions
 > to the problems I mentioned?  Or papers about this kind of thing (since my
 > google-fu has not found too much on the subject so far)?

There is a large literature on garbage collection. These day's I'd try
citeseer as well as Google. Scanning PLDI and OOPSLA proceedings from
the late 70s and 80s will provide plenty of information if you've
either got ACM membership or access to a university library.

Bryce
_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Reply | Threaded
Open this post in threaded view
|

Re: Tagging

Jason Johnson-5
Thanks Bryce and Andrew for your quick and informative responses.  Comments/questions below.

On 8/4/07, [hidden email] <[hidden email]> wrote:

You don't need to shift the numbers for addition and subtraction, just
make sure the bottom bits are 0. Even with the tag being 1, you
only need to detag one argument and if that's a constant the
detagging can be done at compile time.

With multiplication and division only one argument needs to be
shifted.


Interesting.  Well, I assume that in Smalltalk object access is much more often then integer access, but does anyone have any data that proves this?  If it is the case then I guess that's the death of my MSB tagging idea. :)

Doesn't Strongtalk use 0 tagged SmallIntegers?  And if so, why?  I looked in the source code a bit but I haven't figured out yet where that would be.


It was used by some Lisp systems in the mid 80's. I think CMU Common
Lisp from memory.

There are two disadvantages of using the top bit:
1) You need to control the address space to make sure all objects
live. Controlling the address space to that degree is unlikely to
be something that can be done portably.
2) You loose half your address space.


Well, I suppose I could shift the pointers up one (which would make the first bit zero) to reclaim the space, but as I believe it must be the case that object access is much more common then number access, that would be a bad trade-off. 
 

Alignment. Words are aligned to 4 byte boundaries which is always
required for performance and required by some CPU architectures.


I see, I didn't realize the bottom 2 bits had to be zero.  That answers a lot of my questions. :)

There is a large literature on garbage collection. These day's I'd try
citeseer as well as Google. Scanning PLDI and OOPSLA proceedings from
the late 70s and 80s will provide plenty of information if you've
either got ACM membership or access to a university library.

Bryce


I have the "Garbage Collection" book from Jones & Lins, but I will try to look on those sites as well.  I just need to go to the "ACM digital library", no?


_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Reply | Threaded
Open this post in threaded view
|

Re: Tagging

Bryce Kampjes

I'd doubt that using the top bits has any significant advantages
over using the bottom bits with a zero integer tag. Using the top
bits adds the problem of getting the OS to allocate memory to fit
around that policy.

Jason Johnson writes:
 > Thanks Bryce and Andrew for your quick and informative responses.
 > Comments/questions below.
 >
 > On 8/4/07, [hidden email] <[hidden email]> wrote:
 > >
 > >
 > > You don't need to shift the numbers for addition and subtraction, just
 > > make sure the bottom bits are 0. Even with the tag being 1, you
 > > only need to detag one argument and if that's a constant the
 > > detagging can be done at compile time.
 > >
 > > With multiplication and division only one argument needs to be
 > > shifted.
 >
 >
 >
 > Interesting.  Well, I assume that in Smalltalk object access is much more
 > often then integer access, but does anyone have any data that proves this?
 > If it is the case then I guess that's the death of my MSB tagging idea. :)
 >
 > Doesn't Strongtalk use 0 tagged SmallIntegers?  And if so, why?  I looked in
 > the source code a bit but I haven't figured out yet where that would be.

Yes.

The trick is detagging pointers can be  done using offet addressing so
that's  free   as  well  as  detagging     integers for  addition  and
subtraction. So by using a zero tag there's no overhead for common
operations. There's a single shift for multiplication and division but
both these operations can be slow.



 >
 > > It was used by some Lisp systems in the mid 80's. I think CMU Common
 > > Lisp from memory.
 > >
 > > There are two disadvantages of using the top bit:
 > > 1) You need to control the address space to make sure all objects
 > > live. Controlling the address space to that degree is unlikely to
 > > be something that can be done portably.
 > > 2) You loose half your address space.
 >
 >
 >
 > Well, I suppose I could shift the pointers up one (which would make the
 > first bit zero) to reclaim the space, but as I believe it must be the case
 > that object access is much more common then number access, that would be a
 > bad trade-off.
 >
 >
 > Alignment. Words are aligned to 4 byte boundaries which is always
 > > required for performance and required by some CPU architectures.
 >
 >
 >
 > I see, I didn't realize the bottom 2 bits had to be zero.  That answers a
 > lot of my questions. :)
 >
 > There is a large literature on garbage collection. These day's I'd try
 > > citeseer as well as Google. Scanning PLDI and OOPSLA proceedings from
 > > the late 70s and 80s will provide plenty of information if you've
 > > either got ACM membership or access to a university library.
 > >
 > > Bryce
 >
 >
 >
 > I have the "Garbage Collection" book from Jones & Lins, but I will try to
 > look on those sites as well.  I just need to go to the "ACM digital
 > library", no?

ACM digital library should work. I'm not a member so can't say
for sure. Most papers are availible for free for recent stuff and
the key classics are slowly appearing too.

Bryce
_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery