[OT] What is this called?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[OT] What is this called?

Colin Putney-3
Hi folks,

Sorry for the off-topic post. I'm posting it here because I know there are lots of high-powered comp-sci folks lurking, and I hope to benefit from their wisdom.

I have in mind a simple problem. Imagine we have a data structure in the memory of some program. The exact structure it has doesn't matter, but does have structure; it's not just a buffer full of bytes. For argument's sake, we'll assume it's a tree.

Our task it to copy this tree. We want to make another tree in memory that is logically equivalent to the first. This is pretty straightforward. We need to walk the tree, allocating new nodes and copying over any internal values they contain, then recursing into the children. So far so good. But what if we generalize the problem?

One way to do that would be to make the copy more distant from the original. We could copy into a different address space, in another process or on another machine. We could make the copy more distant in time. Perhaps we need to reclaim the memory that our tree uses, and the reconstruct it later. 

This means we'll have to do IO of some kind. The simplest adaptation of our tree-walking algorithm might be to allocate space for the nodes on disk, rather than in memory. This is really simple if we have access to raw disk storage, but gets a little more complicated (and slooow) if we're using files in a file system. We can gain back some simplicity by putting all the nodes in one file, and having some kind of internal structure to the file that lets us recover the nodes and links between them. We could also allocate storage space via communicating with another process. That might be over a network, or using some kind or IPC mechanism supplied by the operating system. Once again we'll need to transmit the data within the nodes, along with some metadata describing the connections between them. 

We can also loosen our definition of "logically equivalent". We might want to construct an equivalent tree in a process running a different version of our program. Or another program entirely, potentially written in another programming language. This forces us to raise the level of abstraction. We can't just copy each node as a blob of raw memory, and assume that the "receiving" program will know how to interpret it. We need some semantic definitions agreed upon by the two programs, and some means of representing those semantics as byte sequences that can be copied between them.

Now, this is not an intractable problem. We do it all the time. In fact, I'd say a large fraction of the code I've written in my career has been a solution to some version of this problem. I'm sure many of you can relate. :-)

So what is this problem called? What theory describes the possible solutions? Are there classes of solutions that have similar trade-offs? Where can I learn more?

Colin


Reply | Threaded
Open this post in threaded view
|

RE: [OT] What is this called?

Ron Teitelbaum

Hi Colin,

 

I think you have mentioned a number of parts of the problem already.  Schema Management, Address Space Management, and Version Management. 

 

Schema management gives you a protocol that allows you to define the source and destination schema along with any information that might be useful for transferring the data to a new environment.

 

Address Space Management allows you to move data from one location to another while maintaining links and data integrity.  What I think of here is that as long as the address space is well defined (move all data from UID to Header+UID) the links can maintain themselves.  It’s also possible to allow nodes to move and manage data instead of having one big process that does it for you.  For example say you have an address space that separates nodes into general address spaces instead of hard addresses.  This would allow you to duplicate data to nodes that grab the closest data to their own address.  Finding data in that structure would be looking for data next to addresses that match stored data index of some kind.  See Distributed Hash Tables for more info.  What’s nice about this method is that the actual structure is not important, nodes can be deleted without losing data, and nodes can maintain their own address space and storage and heal themselves. 

 

Version Management is good for adding links to data that point to versions of objects and allow you to find the latest version.  You could have maps of versions that give you a point in time version of data.  Much like Monticello Configurations. 

 

Combining them all is an interesting idea.  A query of data might first find the right version in time, locate the right data, and then apply the right transformation to give you an answer.  Copying may select just a single version, apply a transformation, and then copy the data to a new address space.

 

That’s my take at least.

 

All the best,

 

Ron Teitelbaum

Head Of Engineering

3D Immersive Collaboration Consulting

[hidden email]

Follow Me On Twitter: @RonTeitelbaum

www.3Dicc.com

https://www.google.com/+3Dicc

 

 

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Colin Putney
Sent: Tuesday, December 29, 2015 3:47 PM
To: The general-purpose Squeak developers list
Subject: [squeak-dev] [OT] What is this called?

 

Hi folks,

 

Sorry for the off-topic post. I'm posting it here because I know there are lots of high-powered comp-sci folks lurking, and I hope to benefit from their wisdom.

 

I have in mind a simple problem. Imagine we have a data structure in the memory of some program. The exact structure it has doesn't matter, but does have structure; it's not just a buffer full of bytes. For argument's sake, we'll assume it's a tree.

 

Our task it to copy this tree. We want to make another tree in memory that is logically equivalent to the first. This is pretty straightforward. We need to walk the tree, allocating new nodes and copying over any internal values they contain, then recursing into the children. So far so good. But what if we generalize the problem?

 

One way to do that would be to make the copy more distant from the original. We could copy into a different address space, in another process or on another machine. We could make the copy more distant in time. Perhaps we need to reclaim the memory that our tree uses, and the reconstruct it later. 

 

This means we'll have to do IO of some kind. The simplest adaptation of our tree-walking algorithm might be to allocate space for the nodes on disk, rather than in memory. This is really simple if we have access to raw disk storage, but gets a little more complicated (and slooow) if we're using files in a file system. We can gain back some simplicity by putting all the nodes in one file, and having some kind of internal structure to the file that lets us recover the nodes and links between them. We could also allocate storage space via communicating with another process. That might be over a network, or using some kind or IPC mechanism supplied by the operating system. Once again we'll need to transmit the data within the nodes, along with some metadata describing the connections between them. 

 

We can also loosen our definition of "logically equivalent". We might want to construct an equivalent tree in a process running a different version of our program. Or another program entirely, potentially written in another programming language. This forces us to raise the level of abstraction. We can't just copy each node as a blob of raw memory, and assume that the "receiving" program will know how to interpret it. We need some semantic definitions agreed upon by the two programs, and some means of representing those semantics as byte sequences that can be copied between them.

 

Now, this is not an intractable problem. We do it all the time. In fact, I'd say a large fraction of the code I've written in my career has been a solution to some version of this problem. I'm sure many of you can relate. :-)

 

So what is this problem called? What theory describes the possible solutions? Are there classes of solutions that have similar trade-offs? Where can I learn more?

 

Colin



Reply | Threaded
Open this post in threaded view
|

Re: [OT] What is this called?

Chris Muller-3
In reply to this post by Colin Putney-3
> So what is this problem called?

I refer to them as the problems of "transparent persistence" and
"normalization".

> What theory describes the possible
> solutions?

I'm not sure what theories, but database implementations address these
issues in practice.  Your were describing Magma exactly up until the
"loosen our definition" paragraph, however, there's no reason the
other program could not be coded to access the Magma DB or files
directly -- the complete meta-definitions are included as part of the
connection process.

> Are there classes of solutions that have similar trade-offs?
> Where can I learn more?

But, obviously you know about ODBMS's, so probably this isn't the
answer you're looking for.

Reply | Threaded
Open this post in threaded view
|

Re: [OT] What is this called?

Huw Lloyd-2
Hi Colin,

To the degree that your conception is object-oriented, in the deeper sense, it is not a problem.  Your object(s) know how to persist themselves and may be considered to be extensive over that persistence -- it is merely that their morphology has changed.

Are you trying to identify names in a known problem space (e.g. a space defined by the limits of the techniques used) or to identify the problem space that you're in?  For the latter, if you're going to generalise then perhaps you should also wave the specialisation of 'perfect' reproduction.  So, even more generally, what you seem to describing is a special case of morphogenesis in which a 'copy' is made, e.g. lazy evaluation and cached computation can be thought of as slices in that process.

Best,
Huw




On 29 December 2015 at 21:32, Chris Muller <[hidden email]> wrote:
> So what is this problem called?

I refer to them as the problems of "transparent persistence" and
"normalization".

> What theory describes the possible
> solutions?

I'm not sure what theories, but database implementations address these
issues in practice.  Your were describing Magma exactly up until the
"loosen our definition" paragraph, however, there's no reason the
other program could not be coded to access the Magma DB or files
directly -- the complete meta-definitions are included as part of the
connection process.

> Are there classes of solutions that have similar trade-offs?
> Where can I learn more?

But, obviously you know about ODBMS's, so probably this isn't the
answer you're looking for.




Reply | Threaded
Open this post in threaded view
|

Re: [OT] What is this called?

Colin Putney-3
In reply to this post by Chris Muller-3


On Tue, Dec 29, 2015 at 1:32 PM, Chris Muller <[hidden email]> wrote:
 
But, obviously you know about ODBMS's, so probably this isn't the
answer you're looking for.

Well, it's the start of an answer. Sure, I know about ODBMSs, they're one solution to this problem. I can copy my data structure (objects, in the case of Squeak/Magma) to the database, then read it back later. But how does that actually work? 

Since the database is running in a separate process, you have to serialize the objects somehow, and transmit them somehow, and when they're read back into a different image, you have to detect and accommodate missing classes and shape changes and so on. I'm sure you've implemented all this in a way that's sensible for the intended uses of Magma. If we compare your solution to say, Gemstone, I'm sure you can talk about this feature or that tradeoff that's different between the two implementations. My question is, despite those differences, are the two solutions essentially the same, or are the differences fundamental? How about if we compare with Croquet? Or CORBA? Or JSON?

I'm hoping this is something that people study, like oh, graph theory or category theory or parsing. 

Thanks!




Reply | Threaded
Open this post in threaded view
|

Re: [OT] What is this called?

Colin Putney-3
In reply to this post by Ron Teitelbaum


On Tue, Dec 29, 2015 at 1:31 PM, Ron Teitelbaum <[hidden email]> wrote:

Hi Colin,

 

I think you have mentioned a number of parts of the problem already.  Schema Management, Address Space Management, and Version Management. 


That sounds interesting. Where do these terms come from?

Colin 



Reply | Threaded
Open this post in threaded view
|

Re: [OT] What is this called?

Phil B
In reply to this post by Colin Putney-3
On Tue, 2015-12-29 at 12:46 -0800, Colin Putney wrote:
>
> So what is this problem called? What theory describes the possible
> solutions? Are there classes of solutions that have similar trade-
> offs? Where can I learn more?
>

Serialization/marshalling? https://en.wikipedia.org/wiki/Serialization

I think of it as serialization if the object is 'dead' when it reaches
its destination, marshalling if I need it 'alive'

Reply | Threaded
Open this post in threaded view
|

Re: [OT] What is this called?

Colin Putney-3
In reply to this post by Huw Lloyd-2


On Tue, Dec 29, 2015 at 3:27 PM, Huw Lloyd <[hidden email]> wrote:
 
To the degree that your conception is object-oriented, in the deeper sense, it is not a problem.  Your object(s) know how to persist themselves and may be considered to be extensive over that persistence -- it is merely that their morphology has changed.

Sure, but that's just hiding the issue behind an abstraction. I'm interested in the mechanics of the change in morphology. If the objects can persist themselves, great, but what technique do they use? 

Are you trying to identify names in a known problem space (e.g. a space defined by the limits of the techniques used) or to identify the problem space that you're in?  For the latter, if you're going to generalise then perhaps you should also wave the specialisation of 'perfect' reproduction.  So, even more generally, what you seem to describing is a special case of morphogenesis in which a 'copy' is made, e.g. lazy evaluation and cached computation can be thought of as slices in that process.

 I think I'm trying to identify the problem space. "Morphogenesis" looks like an interesting term. Lazy evaluation and cached computation seem like opposites sides of the same coin. Do you mean that they are another form of "copying" that might be considered? I suppose that opens up new methods for performing the copy - eg, if the data-structure we're copying is a cached computation, instead of transmitting a description of the data structure, we might transmit the computation that produced it. Of course, then we still have the problem of how to accomplish *that*.

Colin


Reply | Threaded
Open this post in threaded view
|

Re: [OT] What is this called?

Colin Putney-3
In reply to this post by Phil B


On Tue, Dec 29, 2015 at 11:05 PM, Phil (list) <[hidden email]> wrote:
 
Serialization/marshalling? https://en.wikipedia.org/wiki/Serialization

I think of it as serialization if the object is 'dead' when it reaches
its destination, marshalling if I need it 'alive'

Interesting, thanks. What's the difference between dead and alive?

 


Reply | Threaded
Open this post in threaded view
|

Re: [OT] What is this called?

Phil B
On Tue, 2015-12-29 at 23:10 -0800, Colin Putney wrote:

>
>
> On Tue, Dec 29, 2015 at 11:05 PM, Phil (list) <[hidden email]>
> wrote:
>  
> > Serialization/marshalling? https://en.wikipedia.org/wiki/Serializat
> > ion
> >
> > I think of it as serialization if the object is 'dead' when it
> > reaches
> > its destination, marshalling if I need it 'alive'
> Interesting, thanks. What's the difference between dead and alive?
>
>  

Dead meaning you just want to persist/transfer the object state (i.e.
write to disk etc.), alive meaning you want an actual instance with the
specified object state to exist when it reaches its destination (i.e.
recreate the object at the destination whether it be in another
instance on the same machine, across the network, or even in an
entirely different language/runtime environment)

Reply | Threaded
Open this post in threaded view
|

Re: [OT] What is this called?

Huw Lloyd-2
In reply to this post by Colin Putney-3


On 30 December 2015 at 07:07, Colin Putney <[hidden email]> wrote:


On Tue, Dec 29, 2015 at 3:27 PM, Huw Lloyd <[hidden email]> wrote:
 
To the degree that your conception is object-oriented, in the deeper sense, it is not a problem.  Your object(s) know how to persist themselves and may be considered to be extensive over that persistence -- it is merely that their morphology has changed.

Sure, but that's just hiding the issue behind an abstraction. I'm interested in the mechanics of the change in morphology. If the objects can persist themselves, great, but what technique do they use? 

I thought that handling a polymorphic structure was half of your problem, Colin?  This is one of those cases where this mechanism can be delegated to each successive object.  

Are you trying to identify names in a known problem space (e.g. a space defined by the limits of the techniques used) or to identify the problem space that you're in?  For the latter, if you're going to generalise then perhaps you should also wave the specialisation of 'perfect' reproduction.  So, even more generally, what you seem to describing is a special case of morphogenesis in which a 'copy' is made, e.g. lazy evaluation and cached computation can be thought of as slices in that process.

 I think I'm trying to identify the problem space. "Morphogenesis" looks like an interesting term. Lazy evaluation and cached computation seem like opposites sides of the same coin. Do you mean that they are another form of "copying" that might be considered? I suppose that opens up new methods for performing the copy - eg, if the data-structure we're copying is a cached computation, instead of transmitting a description of the data structure, we might transmit the computation that produced it

Yes, or the 'transmit' might be pure calculation.  Really, the storing and sending of values is a special case (short-cut) of a process.

See SICP on streams:
 
. Of course, then we still have the problem of how to accomplish *that*.

Quite so, but that is something that can be localised too, if it was desirable.  There's no mechanistic reason why it has to be the same across objects.

I vaguely recall that William Kent's "Data and Reality" has some of these ideas in the form of entity - object relations which might help you too.

Best,
Huw



Colin






Reply | Threaded
Open this post in threaded view
|

RE: [OT] What is this called?

Ron Teitelbaum
In reply to this post by Colin Putney-3

 

 

 

On Tue, Dec 29, 2015 at 1:31 PM, Ron Teitelbaum <[hidden email]> wrote:

Hi Colin,

I think you have mentioned a number of parts of the problem already.  Schema Management, Address Space Management, and Version Management. 

 

That sounds interesting. Where do these terms come from?

 

From me but I didn’t make them up.  I’ve worked with a number of databases.  When I started I wrote my own data processing programs in a spreadsheet environment called Symphony.  This was just data and functions to manipulate data.  Later I worked with dBase and Paradox.  These added the concept of a defined Schema Management and indexes.  I then moved to Oracle that added the concept of versions, locking and logging, (the locking and logging part is Version Management (adding lock then commit or rollback).  Versant the object database gave me the concepts of Address Spaces and finally DHT added the details behind what Address Spaces could do.   

 

When dealing with Schema Management I developed a the method to manage upgrading a database where I added a db upgrade table.  It contained the information needed to move the database forward or backwards.  Basically I added a version number and a row for each table that changed, the method to change the schema, and the transforms needed to move or convert the data.  Clients would connect and verify the schema version from that table and could trigger an upgrade with the proper permissions or error if the version didn’t match.  The table is a way to automate what would have been a manual process (run this query to update the table definition, then run these queries to transform the data, now call yourself this version).  We also added the reverse to go back a version.  This was really just a failsafe to allow you to rollback and recover from an upgrade error.  Having to do both caused some interesting discussions.  Data that could have been lost, which was no longer needed or collected in a newer version of the system could cause issues when rolling back, this would usually require developers to add default data, “no data collected”, or something of that type.  It made the process more thoughtful and helped to protect the system. 

 

The idea of a more detailed Version Management that gives you point in time is really just a DB design issue.  Linked versions of objects can be traversed for versions and index snapshots of a point in time are easy to design in if the data is important.

 

Thinking in terms of data is something that happens pretty naturally especially if you started working with data. I had a friend that always said that no good programmers started programming after 1985.  The reason for that was that before 1985 you had to work with data structures and you learned them.  I’m sure that is not entirely true but I am glad that I’ve been so exposed to data and can think in terms of data and it’s structure.  I developed a system that did some amazing data transformations (reading what a physician wrote and converting into what a pharmacist would dispense).  This sounds like an easy problem converting dose to strength but it was a very complicated problem.  Think in terms of Brand, Generic, Formularies and Inventory, drug-drug interactions and other health checks, the number of different sizes of pills and forms (tablet, gel caps), then add on automated packaging, billing, tracking and medical records.  It was quite a system.  When I was designing it, I kept getting the question when are you going to show us a screen for how this will work.  I said not until I have a data structure that can handle it.  I worked on the underlying data structure, then the programming to support it.  People were getting quite upset with me since you could not see anything that worked.  I knew it worked and could show them in code but that didn’t seem to be good enough.  We added the UI at the end and it took only a few days.  Had we started with the UI and gone from that to the database we may never have been able to solve the problem.  Real systems require detailed data structures all of these methods for managing them are part of that.

 

Does that help?

 

All the best,

 

Ron Teitelbaum

Head Of Engineering

3D Immersive Collaboration Consulting

[hidden email]

Follow Me On Twitter: @RonTeitelbaum

www.3Dicc.com

https://www.google.com/+3Dicc

 

 

 

Colin