Smalltalk › Squeak › Squeak VM

A Smalltalk object database idea

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

8 messages Options

Louis LaBrunda

A Smalltalk object database idea

Hello Squeak VM Guys,

My name is Louis LaBrunda. I use Instantiations VA Smalltalk but dabble
with Squeak from time to time.

I have an outside-the-box way of implementing an object database for
Smalltalk that I would like to see if there is anyone here who is
interested in implementing. I understand the theory behind Smalltalk VMs
(at least I think I do) but would require a large learning curve to
actually modify one. This idea doesn't require the inventing or improving
of any technology but it does require changes to the VM.

For the purpose of describing this idea, I will deal with only one database
and not go into binding to the database and other details like transaction
processing and such. These things are of course important but I think they
can be handled in very much standard ways that should not be changed by
this means of implementing the object database.

The idea is that the VM would treat the database file much like a CPU chip
would treat RAM and would treat its (the VM) memory like a CPU chip would
treat its internal (on-chip) cache. There would be a similar means of
linking the data in memory to the data in the database as there is between
linking a CPU chip's cache and RAM.

A I said, I'm not very knowledgeable of the internal working of Smalltalk
VMs, so much of what I am about to say is guess work but I think it is
accurate. Objects represented in the memory of a Smalltalk VM probably
take up about 12 bytes or so for 32 bit systems, more for 64 bit systems.
Much of these bytes are bits that define the class. Some of the bytes
might be the value of the object if it is say a small integer or a byte or
character. If the data (value) of the object is larger than will fit in a
few bytes, there is a pointer to the data. If the object has instance
variables that are of course other objects, there are pointers to them.

A bit would be needed to indicate a persisted object and probably another
bit to indicate the object is dirty (changed and therefore doesn't match
the database file copy). Objects with the persisted bit off would
otherwise look and be treated the same as they are now. Objects with the
persisted bit on would have all their pointers replaced with offsets from
the beginning of the database file (a single file containing all the
persisted objects. All objects pointed to by a persisted object must also
be persisted objects.

When the VM comes across a persisted object it would use the pointers (that
are now offsets within the database file) as keys into a lookup table (hash
table) to find the real pointer to the data in memory. If the item is
found in the lookup table the value is used as it would have been if it was
in the object and all is the same. If the item is not found in the lookup
table the offset into the database file is used to read the object from the
database. The lookup table would then be updated to include the new item.

As far as I can tell the copies of the object in memory and in the database
file can be identical (no object dumper/loader serialization). There may
need to be a little bit of a wrapper in the database file but I don't think
much. This should make for a very quick loading and saving of objects.

Probably some objects, like blocks of code can't or shouldn't be saved to
the database (I'm not sure if this is true for Squeak). But I don't think
that is any different than systems that use object dumper/loader
serialization.

I think a low priority fork could run through the lookup table for objects
with the dirty bit set and save them to the database file. A #persist (or
some other good name) method could be added to #Object to force the saving
of an object to the database. This would probably be implemented with a
primitive but maybe not.

There may be some changes needed for garbage collection to keep the lookup
table up to date but I don't think that will be a big deal. Hopefully
garbage collection for the database file could be handled mostly by
Smalltalk code with the help of a few primitives.

Well, that's it for now. I hope this has been an interesting read and not
a waste of your time. If you think the idea has merit, let me know and we
can discuss it further.

Thank you very much for your time.

Lou
-----------------------------------------------------------
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:[hidden email] http://www.Keystone-Software.com

David T. Lewis

Re: A Smalltalk object database idea

Hello and welcome.

I think you'll find the Squeak VM to be quite adaptable to experiments
like this. All the code for the object memory is written in Smalltalk
(actually a limited subset of Smalltalk), so it is quite accessible
and relatively easy to modify.

If you have not already done so, try loading the VMMaker package
from SqueakSource, and read the class comment of ObjectMemory for
a description of the object memory organization and header formats
(I'm not sure how familiar you are with Squeak at this point, so
ask some questions if this is not clear). Also, read the "Back
to the Future" paper for general background:
http://ftp.squeak.org/docs/OOPSLA.Squeak.html

Dave

On Tue, Dec 29, 2009 at 10:10:11AM -0500, Louis LaBrunda wrote:

>
> Hello Squeak VM Guys,
>
> My name is Louis LaBrunda. I use Instantiations VA Smalltalk but dabble
> with Squeak from time to time.
>
> I have an outside-the-box way of implementing an object database for
> Smalltalk that I would like to see if there is anyone here who is
> interested in implementing. I understand the theory behind Smalltalk VMs
> (at least I think I do) but would require a large learning curve to
> actually modify one. This idea doesn't require the inventing or improving
> of any technology but it does require changes to the VM.
>
> For the purpose of describing this idea, I will deal with only one database
> and not go into binding to the database and other details like transaction
> processing and such. These things are of course important but I think they
> can be handled in very much standard ways that should not be changed by
> this means of implementing the object database.
>
> The idea is that the VM would treat the database file much like a CPU chip
> would treat RAM and would treat its (the VM) memory like a CPU chip would
> treat its internal (on-chip) cache. There would be a similar means of
> linking the data in memory to the data in the database as there is between
> linking a CPU chip's cache and RAM.
>
> A I said, I'm not very knowledgeable of the internal working of Smalltalk
> VMs, so much of what I am about to say is guess work but I think it is
> accurate. Objects represented in the memory of a Smalltalk VM probably
> take up about 12 bytes or so for 32 bit systems, more for 64 bit systems.
> Much of these bytes are bits that define the class. Some of the bytes
> might be the value of the object if it is say a small integer or a byte or
> character. If the data (value) of the object is larger than will fit in a
> few bytes, there is a pointer to the data. If the object has instance
> variables that are of course other objects, there are pointers to them.
>
> A bit would be needed to indicate a persisted object and probably another
> bit to indicate the object is dirty (changed and therefore doesn't match
> the database file copy). Objects with the persisted bit off would
> otherwise look and be treated the same as they are now. Objects with the
> persisted bit on would have all their pointers replaced with offsets from
> the beginning of the database file (a single file containing all the
> persisted objects. All objects pointed to by a persisted object must also
> be persisted objects.
>
> When the VM comes across a persisted object it would use the pointers (that
> are now offsets within the database file) as keys into a lookup table (hash
> table) to find the real pointer to the data in memory. If the item is
> found in the lookup table the value is used as it would have been if it was
> in the object and all is the same. If the item is not found in the lookup
> table the offset into the database file is used to read the object from the
> database. The lookup table would then be updated to include the new item.
>
> As far as I can tell the copies of the object in memory and in the database
> file can be identical (no object dumper/loader serialization). There may
> need to be a little bit of a wrapper in the database file but I don't think
> much. This should make for a very quick loading and saving of objects.
>
> Probably some objects, like blocks of code can't or shouldn't be saved to
> the database (I'm not sure if this is true for Squeak). But I don't think
> that is any different than systems that use object dumper/loader
> serialization.
>
> I think a low priority fork could run through the lookup table for objects
> with the dirty bit set and save them to the database file. A #persist (or
> some other good name) method could be added to #Object to force the saving
> of an object to the database. This would probably be implemented with a
> primitive but maybe not.
>
> There may be some changes needed for garbage collection to keep the lookup
> table up to date but I don't think that will be a big deal. Hopefully
> garbage collection for the database file could be handled mostly by
> Smalltalk code with the help of a few primitives.
>
> Well, that's it for now. I hope this has been an interesting read and not
> a waste of your time. If you think the idea has merit, let me know and we
> can discuss it further.
>
> Thank you very much for your time.
>
> Lou
> -----------------------------------------------------------
> Louis LaBrunda
> Keystone Software Corp.
> SkypeMe callto://PhotonDemon
> mailto:[hidden email] http://www.Keystone-Software.com

Stephen Pair

Re: A Smalltalk object database idea

In reply to this post by Louis LaBrunda

OOZE and LOOM by Ted Kaehler, et al did this kind of thing. Here's a link to the 1981 article on OOZE:

http://www-cs-students.stanford.edu/~eswierk/misc/kaehler81/

It mentions LOOM, but doesn't go into detail...I think the more detailed LOOM paper(s) are in the ACM digital library.

- Stephen

On Tue, Dec 29, 2009 at 10:10 AM, Louis LaBrunda <[hidden email]> wrote:

Hello Squeak VM Guys,

My name is Louis LaBrunda. I use Instantiations VA Smalltalk but dabble
with Squeak from time to time.

I have an outside-the-box way of implementing an object database for
Smalltalk that I would like to see if there is anyone here who is
interested in implementing. I understand the theory behind Smalltalk VMs
(at least I think I do) but would require a large learning curve to
actually modify one. This idea doesn't require the inventing or improving
of any technology but it does require changes to the VM.

For the purpose of describing this idea, I will deal with only one database
and not go into binding to the database and other details like transaction
processing and such. These things are of course important but I think they
can be handled in very much standard ways that should not be changed by
this means of implementing the object database.

The idea is that the VM would treat the database file much like a CPU chip
would treat RAM and would treat its (the VM) memory like a CPU chip would
treat its internal (on-chip) cache. There would be a similar means of
linking the data in memory to the data in the database as there is between
linking a CPU chip's cache and RAM.

A I said, I'm not very knowledgeable of the internal working of Smalltalk
VMs, so much of what I am about to say is guess work but I think it is
accurate. Objects represented in the memory of a Smalltalk VM probably
take up about 12 bytes or so for 32 bit systems, more for 64 bit systems.
Much of these bytes are bits that define the class. Some of the bytes
might be the value of the object if it is say a small integer or a byte or
character. If the data (value) of the object is larger than will fit in a
few bytes, there is a pointer to the data. If the object has instance
variables that are of course other objects, there are pointers to them.

A bit would be needed to indicate a persisted object and probably another
bit to indicate the object is dirty (changed and therefore doesn't match
the database file copy). Objects with the persisted bit off would
otherwise look and be treated the same as they are now. Objects with the
persisted bit on would have all their pointers replaced with offsets from
the beginning of the database file (a single file containing all the
persisted objects. All objects pointed to by a persisted object must also
be persisted objects.

When the VM comes across a persisted object it would use the pointers (that
are now offsets within the database file) as keys into a lookup table (hash
table) to find the real pointer to the data in memory. If the item is
found in the lookup table the value is used as it would have been if it was
in the object and all is the same. If the item is not found in the lookup
table the offset into the database file is used to read the object from the
database. The lookup table would then be updated to include the new item.

As far as I can tell the copies of the object in memory and in the database
file can be identical (no object dumper/loader serialization). There may
need to be a little bit of a wrapper in the database file but I don't think
much. This should make for a very quick loading and saving of objects.

Probably some objects, like blocks of code can't or shouldn't be saved to
the database (I'm not sure if this is true for Squeak). But I don't think
that is any different than systems that use object dumper/loader
serialization.

I think a low priority fork could run through the lookup table for objects
with the dirty bit set and save them to the database file. A #persist (or
some other good name) method could be added to #Object to force the saving
of an object to the database. This would probably be implemented with a
primitive but maybe not.

There may be some changes needed for garbage collection to keep the lookup
table up to date but I don't think that will be a big deal. Hopefully
garbage collection for the database file could be handled mostly by
Smalltalk code with the help of a few primitives.

Well, that's it for now. I hope this has been an interesting read and not
a waste of your time. If you think the idea has merit, let me know and we
can discuss it further.

Thank you very much for your time.

Lou
-----------------------------------------------------------
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:[hidden email] http://www.Keystone-Software.com

Jecel Assumpcao Jr

Re: A Smalltalk object database idea

Stephen Pair wrote:
> OOZE and LOOM by Ted Kaehler, et al did this kind of thing. Here's a link to the 1981 article on OOZE:http://www-cs-students.stanford.edu/~eswierk/misc/kaehler81/
> It mentions LOOM, but doesn't go into detail...I think the more detailed LOOM paper(s) are in the ACM digital library.

There are some papers, but chapter 14 of the "green book" is probably
the best place to learn about LOOM. The book is available at

http://stephane.ducasse.free.fr/FreeBooks/BitsOfHistory/

Though LOOM (and OOZE) are actually virtual memory systems rather than
databases, if you don't define what you mean by "database" then they are
probably good enough for most uses. Going more in the direction of
industry standard databases, Gemstone is a great example of what can be
done in Smalltalk. This paper (which I can't read right now) probably
has some information about it:

http://portal.acm.org/citation.cfm?id=125223.125254

-- Jecel

Louis LaBrunda

Re: A Smalltalk object database idea

In reply to this post by Stephen Pair

Hi Stephen,

Thanks for the reference. OOZE and maybe LOOM (I couldn't see much about
LOOM) seem to be virtual memory for objects. A way to expand the size of
memory. I'm talking about an object database built with virtual memory
ideas. I know databases are ways to expand the size of memory but I'm
looking at their persistence feature and not making memory look bigger.

In my scheme, the lookup table is used to find persisted (database only)
objects in memory. Non database objects are NOT in the lookup table. Other
than the time it takes to test if an object is persisted (a bit that
indicates it is in the database) processing of non database objects is
normal.

Database objects need a little more work. If they are in the lookup table,
they are easily found in memory. If not in the lookup table, they can be
read from the database and the lookup table updated.

Lou

>OOZE and LOOM by Ted Kaehler, et al did this kind of thing. Here's a link to the 1981 article on OOZE:http://www-cs-students.stanford.edu/~eswierk/misc/kaehler81/
>It mentions LOOM, but doesn't go into detail...I think the more detailed LOOM paper(s) are in the ACM digital library.
>
>- Stephen
-----------------------------------------------------------
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:[hidden email] http://www.Keystone-Software.com

Stephen Pair

Re: Re: A Smalltalk object database idea

On Tue, Dec 29, 2009 at 11:56 AM, Louis LaBrunda <[hidden email]> wrote:

Hi Stephen,

Thanks for the reference. OOZE and maybe LOOM (I couldn't see much about
LOOM) seem to be virtual memory for objects. A way to expand the size of
memory. I'm talking about an object database built with virtual memory
ideas. I know databases are ways to expand the size of memory but I'm
looking at their persistence feature and not making memory look bigger.

Yes, I know, but you will find that you will face many of the same issues that OOZE and LOOM dealt with. I actually implemented a system much like what you are describing in squeak a number of years ago. I used BerkeleyDB as my object storage. It was possible to connect multiple squeak processes to a common database. There was a transactional system that let me track changes to disk based objects and commit them. You could work with disk based objects transparently. Working with the squeak VM was challenging in the sense that it is all very highly tuned for optimal memory use, fast GC, etc. I had to perform a lot of system tracing to transform squeak images to my new object memory layout, etc. To fault in objects quickly, I had to implement a fast become operation. Since squeak has no object table, I implemented a forwarder capability that would transform any object into a forwarder to another object by setting a header bit, then using the class pointer to point to the target object (which then necessitated doing away with the compact class header format). GC would sweep the forwarders away when it ran. IIRC, I managed to this with something like a 10% performance and memory hit.

I got to a point where I realized I needed to also be able to persist classes and move them among different squeak images that might have different versions of like named classes and so forth (so you get into namespace issues). I eventually ran out of steam and abandoned the project. Croquet was also just starting up at the time, so I felt they would eventually solve many of these issues.

With that experience, I now believe you really need a new language (that deals with namespace and security issues ala Newspeak) and COLA (VPRI research) like VM architectures (that are easily customized) to explore things like this...I'm hoping such things are not that far off.

- Stephen

Louis LaBrunda

Re: A Smalltalk object database idea

In reply to this post by Stephen Pair

Hi Jecel,

>Stephen Pair wrote:
>> OOZE and LOOM by Ted Kaehler, et al did this kind of thing. Here's a link to the 1981 article on OOZE:http://www-cs-students.stanford.edu/~eswierk/misc/kaehler81/
>> It mentions LOOM, but doesn't go into detail...I think the more detailed LOOM paper(s) are in the ACM digital library.
>
>There are some papers, but chapter 14 of the "green book" is probably
>the best place to learn about LOOM. The book is available at
>
>http://stephane.ducasse.free.fr/FreeBooks/BitsOfHistory/
>
>Though LOOM (and OOZE) are actually virtual memory systems rather than
>databases, if you don't define what you mean by "database" then they are
>probably good enough for most uses. Going more in the direction of
>industry standard databases, Gemstone is a great example of what can be
>done in Smalltalk. This paper (which I can't read right now) probably
>has some information about it:
>http://portal.acm.org/citation.cfm?id=125223.125254
>-- Jecel

I don't know a lot about Gemstone or how it is implemented but an object
database is what I am trying to achieve. In VA Smalltalk there is Voss
from Logicarts http://voss.logicarts.com/ and Tenacity from TotallyObjects
http://www.totallyobjects.com/tenacity.htm. I think both are very good
especially Voss. Both are written is Smalltalk without modification of the
VM. I think both use proxy objects to link the object in memory with the
database. I believe they save/read objects to/from the database (made up
of very many small files) with an object dumper/loader.

My idea (if it can work) uses one file (or at least very few) and doesn't
use an object dumper/loader. I think this may make things faster and
simpler. By simpler, I mean much less Smalltalk code, no proxy objects,
easier backup of the database (since it is just one file). The lack of use
of the object dumper/loader may require more work if an object definition
changes.

Lou
-----------------------------------------------------------
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon
mailto:[hidden email] http://www.Keystone-Software.com

David T. Lewis

Re: Re: A Smalltalk object database idea

On Tue, Dec 29, 2009 at 02:33:36PM -0500, Louis LaBrunda wrote:
>
> I don't know a lot about Gemstone or how it is implemented but an object
> database is what I am trying to achieve. In VA Smalltalk there is Voss
> from Logicarts http://voss.logicarts.com/ and Tenacity from TotallyObjects
> http://www.totallyobjects.com/tenacity.htm. I think both are very good
> especially Voss. Both are written is Smalltalk without modification of the
> VM. I think both use proxy objects to link the object in memory with the
> database. I believe they save/read objects to/from the database (made up
> of very many small files) with an object dumper/loader.

You will also want to have a look at Magma, which is written in Squeak:
http://wiki.squeak.org/squeak/2665

Dave