Island snapshots

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Island snapshots

Peter Moore-5
What are the "rules" regarding when it is safe to take a snapshot of  
an island? I'd like to be able to save islands (or in this case  
specifically croquet spaces) on demand – via a menu item – or  
periodically as a backup mechanism for persistent spaces. Saving the  
island on demand has been working now for awhile without any  
problems, but maybe I've just been getting lucky? I've added a new  
method to the harness that returns the island data:

activeIslandData

        ^ (self activeSpace island get: #controller) snapshot

This was hooked up to a menu item and the data is stored in a file.  
No problems. Next I created a new process that would call this method  
every 30 seconds (just for testing purposes). Eventually I will get a  
Checkpoint Failure. After some debugging I see that after  
IslandWriter>>storeSegmentFor:into:outPointers: is called that there  
will be something that isn't filtered out. In this case it was a  
TObjectID that mapped to a TAvatarReplica. I won't pretend I fully  
understand what is happening during the snapshot process so that is  
about the best I can do in describing the situation (at one point I  
tried to call TIslandController>>snapshot from the space itself which  
really didn't work).

I've looked at what Wisconsin is doing with WiscWorlds, basically  
they send requests for a snapshot of the island through the router.  
Is this really necessary? If so can someone explain to me the reason  
why? This approach requires (I believe) adding to the facet list on  
both the controller and router. Also it seems some class methods had  
to be modified. Do you really need to do all of this to safely  
snapshot an island?

-Peter



Reply | Threaded
Open this post in threaded view
|

Re: Island snapshots

Joshua Gargus-2

On Aug 28, 2006, at 4:14 PM, Peter Moore wrote:

> What are the "rules" regarding when it is safe to take a snapshot  
> of an island? I'd like to be able to save islands (or in this case  
> specifically croquet spaces) on demand – via a menu item – or  
> periodically as a backup mechanism for persistent spaces. Saving  
> the island on demand has been working now for awhile without any  
> problems, but maybe I've just been getting lucky? I've added a new  
> method to the harness that returns the island data:
>
> activeIslandData
>
> ^ (self activeSpace island get: #controller) snapshot
>
> This was hooked up to a menu item and the data is stored in a file.  
> No problems. Next I created a new process that would call this  
> method every 30 seconds (just for testing purposes). Eventually I  
> will get a Checkpoint Failure. After some debugging I see that  
> after IslandWriter>>storeSegmentFor:into:outPointers: is called  
> that there will be something that isn't filtered out. In this case  
> it was a TObjectID that mapped to a TAvatarReplica. I won't pretend  
> I fully understand what is happening during the snapshot process so  
> that is about the best I can do in describing the situation (at one  
> point I tried to call TIslandController>>snapshot from the space  
> itself which really didn't work).
>
> I've looked at what Wisconsin is doing with WiscWorlds, basically  
> they send requests for a snapshot of the island through the router.  
> Is this really necessary? If so can someone explain to me the  
> reason why? This approach requires (I believe) adding to the facet  
> list on both the controller and router. Also it seems some class  
> methods had to be modified. Do you really need to do all of this to  
> safely snapshot an island?
>

I'm not sure that what we did isn't a bit of overkill.  It seems to  
me that we could achieve our goal just as safely by directly adding  
our request to the controller's 'eventQueue'.

The reason we did this is because we tried doing the same thing that  
you are, and ran into the same problems.  I don't remember now if we  
involve the router for a good reason, or because we missed the obvious.

Josh

> -Peter
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Island snapshots

Andreas.Raab
In reply to this post by Peter Moore-5
Peter Moore wrote:
> This was hooked up to a menu item and the data is stored in a file. No
> problems. Next I created a new process that would call this method every
> 30 seconds (just for testing purposes). Eventually I will get a
> Checkpoint Failure. After some debugging I see that after
> IslandWriter>>storeSegmentFor:into:outPointers: is called that there
> will be something that isn't filtered out. In this case it was a
> TObjectID that mapped to a TAvatarReplica.

You should have been presented with a workspace's full of information
about the problem. Do you still have that info (or can reproduce it)?
I'd be interested in seeing where the snapshot process fails - in theory
it should be possible to run it from anywhere except inside the island
you're snapshotting so I think this is a bug.

> I've looked at what Wisconsin is doing with WiscWorlds, basically they
> send requests for a snapshot of the island through the router. Is this
> really necessary? If so can someone explain to me the reason why? This
> approach requires (I believe) adding to the facet list on both the
> controller and router. Also it seems some class methods had to be
> modified. Do you really need to do all of this to safely snapshot an
> island?

Seems excessive.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Island snapshots

Andreas.Raab
In reply to this post by Peter Moore-5
Peter Moore wrote:
> Hey Andreas,
>
> I've attached the rather large SnapshotTracer log as a text file. In the
> future, what should I be looking for? Thanks for your help.

The key observation is here:

        root: Smalltalk specialObjects (Array)
        4: Association
        value: ProcessorScheduler
        quiescentProcessLists: Array
        40: LinkedList
        firstLink: Process
        suspendedContext: MethodContext
        receiver: TAvatarReplicaMotion

This tells us that you were trying to execute a snapshot from a
concurrent process while some other process was executing in the island
(we see this in the above since the receiver is an avatar replica
belonging to the island).

The bottom line being that indeed it's indeed a bug - we probably need
to lock both controller and island in the snapshot.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Island snapshots

Howard Stearns
In reply to this post by Peter Moore-5
In our experiences, I think there were a couple of issues, all based  
on the having the snapshot occur during island rendering, which is on-
island.

1. Tweak scripts run at a high priority, so Tweak menus can happen  
during rendering.
Our first attempt to deal with this was to have the Tweak menu action  
fork off a separate process:
     [controller doTheSnasphotThing] forkAt: Processor  
userSchedulingPriority + 2.

2. Can keyboard and pointer events interrupt rendering? Especially  
considering that the handlers for these events might unintentionally  
suspend. (At the time, we didn't know, e.g., about Transcript doing  
refreshWorld, which may not be an example of this, but shows the kind  
of thing that can happen. See my 2006-08-14 message.)

The fork hack helped quite a bit, but we did still occasionally run  
into problems. The only sure-fire way to avoid them was to have the  
controller doTheSnapshotThing as part of its message processing.    
Josh mentioned that this could maybe have been done with something like:
     controller eventQueue nextPut: (MessageSend receiver: controller  
selector: #doTheSnapshotThing)
instead of making a round trip to the router.  I can't remember why I  
thought I had trouble inserting stuff  into the queue -- so go ahead  
and try it!  I think maybe it had something to do with getting it  
inserted in the proper sequence so that the frozen island time  
included any messages that would eventually be timestamped by the  
router before the snapshot time.  For example, the #nextPut:, above,  
is backwards. It would have to be inserted at the other end.  But  
give it a shot.

I think that locking both controller and island during the snapshot  
won't fix these problems, unless rendering locks on the same objects.  
But that would be equivalent to doing the snapshot through the  
controller queue  (which seems to me to be a more elegant way of  
"locking").


On Aug 29, 2006, at 6:51 PM, Andreas Raab wrote:

> Peter Moore wrote:
>> Hey Andreas,
>> I've attached the rather large SnapshotTracer log as a text file.  
>> In the future, what should I be looking for? Thanks for your help.
>
> The key observation is here:
>
> root: Smalltalk specialObjects (Array)
> 4: Association
> value: ProcessorScheduler
> quiescentProcessLists: Array
> 40: LinkedList
> firstLink: Process
> suspendedContext: MethodContext
> receiver: TAvatarReplicaMotion
>
> This tells us that you were trying to execute a snapshot from a  
> concurrent process while some other process was executing in the  
> island (we see this in the above since the receiver is an avatar  
> replica belonging to the island).
>
> The bottom line being that indeed it's indeed a bug - we probably  
> need to lock both controller and island in the snapshot.
>
> Cheers,
>   - Andreas
>


Reply | Threaded
Open this post in threaded view
|

Re: Island snapshots

Peter Moore-5
In reply to this post by Peter Moore-5
FWIW - The checkpoint failures have been occurring even when I'm not  
interacting with the croquet window. In fact it usually happens when  
I'm not even working in squeak. I started the "auto-save" (which  
forks a new squeak process which does the checkpointing) and let it  
run while I did other stuff. So I think that should rule out any  
keyboard or pointer events or Tweak menus.

Is there any potential problems with manipulating the controller's  
eventQueue directly? I'm assuming that this is where the queued up  
messages from the router are stored. Will messing with it affect  
synchronization? What is the magic that is happening in the event  
loop that solves our problem?

On Aug 29, 2006, at 9:27 PM, Howard Stearns wrote:

> In our experiences, I think there were a couple of issues, all  
> based on the having the snapshot occur during island rendering,  
> which is on-island.
>
> 1. Tweak scripts run at a high priority, so Tweak menus can happen  
> during rendering.
> Our first attempt to deal with this was to have the Tweak menu  
> action fork off a separate process:
>     [controller doTheSnasphotThing] forkAt: Processor  
> userSchedulingPriority + 2.
>
> 2. Can keyboard and pointer events interrupt rendering? Especially  
> considering that the handlers for these events might  
> unintentionally suspend. (At the time, we didn't know, e.g., about  
> Transcript doing refreshWorld, which may not be an example of this,  
> but shows the kind of thing that can happen. See my 2006-08-14  
> message.)
>
> The fork hack helped quite a bit, but we did still occasionally run  
> into problems. The only sure-fire way to avoid them was to have the  
> controller doTheSnapshotThing as part of its message processing.    
> Josh mentioned that this could maybe have been done with something  
> like:
>     controller eventQueue nextPut: (MessageSend receiver:  
> controller selector: #doTheSnapshotThing)
> instead of making a round trip to the router.  I can't remember why  
> I thought I had trouble inserting stuff  into the queue -- so go  
> ahead and try it!  I think maybe it had something to do with  
> getting it inserted in the proper sequence so that the frozen  
> island time included any messages that would eventually be  
> timestamped by the router before the snapshot time.  For example,  
> the #nextPut:, above, is backwards. It would have to be inserted at  
> the other end.  But give it a shot.
>
> I think that locking both controller and island during the snapshot  
> won't fix these problems, unless rendering locks on the same  
> objects. But that would be equivalent to doing the snapshot through  
> the controller queue  (which seems to me to be a more elegant way  
> of "locking").
>
>
> On Aug 29, 2006, at 6:51 PM, Andreas Raab wrote:
>
>> Peter Moore wrote:
>>> Hey Andreas,
>>> I've attached the rather large SnapshotTracer log as a text file.  
>>> In the future, what should I be looking for? Thanks for your help.
>>
>> The key observation is here:
>>
>> root: Smalltalk specialObjects (Array)
>> 4: Association
>> value: ProcessorScheduler
>> quiescentProcessLists: Array
>> 40: LinkedList
>> firstLink: Process
>> suspendedContext: MethodContext
>> receiver: TAvatarReplicaMotion
>>
>> This tells us that you were trying to execute a snapshot from a  
>> concurrent process while some other process was executing in the  
>> island (we see this in the above since the receiver is an avatar  
>> replica belonging to the island).
>>
>> The bottom line being that indeed it's indeed a bug - we probably  
>> need to lock both controller and island in the snapshot.
>>
>> Cheers,
>>   - Andreas
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: Island snapshots

Howard Stearns
In reply to this post by Peter Moore-5
Sorry, I should have been clearer, but I don't understand it well enough
yet. (And so I get verbose. But hey, I'm trying to talk my way through
to understanding it.) I'm speculating that the fundamental issue is that
an island process is getting interleaved with non-island process.
   By island process, I mean something that is intended to be executed
with the current island for the process bound to the island being
snapshotted.
   By non-island process, I mean something that is intended to be
executed with anything else (typically island default, aka, Squeak or
"the ocean") as the current island.

An island process should not manipulate a non-island object directly,
and vice versa. But most object's don't actually know what island they
are supposed to be part of.  As I understand it, "island discipline" is
actually implemented as, or depends on, getting the "process discipline"
right.

Now, the participant and harness have a very nice set of processes that
gets everything right. If you only execute island stuff by handling a
message which the island controller told you to handle, then everything
should work right.  But what happens when this machinery is correctly
processing a controller's messages -- which cause the on-island stuff to
happen in the right context -- and it either gets pre-empted by a higher
priority off-island process, or it yields to some other off-island
process that was already running?  Answer: I don't know! But my claim is
that this can lead to off-island process running with the island as its
current island, or vice versa.

*1* This can be a problem if new objects get created by the wrong
process, and therefore are not in the right island. It causes far refs
to appear that should be near refs. (The other mail message I cited,
below.)

*2* But quite often, I THINK such interleaving ends up being harmless --
UNLESS you happen to take your snapshot while they are interleaved.
Snapshotting is an off-island process that must not happen while in the
middle of running island stuff.  If you did, then there could be island
objects directly on the stack (not guarded by far refs) or in other
places that are traceable from the ocean roots.  If these processes
interleave while you're not snapshotting -- who cares?  But if you are
snapshotting, the machinery will see this and barf.  *** Note that the
handling of Andreas' sync message from the router, and the handling of
my caching message from the router, are both handled by the controller
directly. No on-island objects are referenced! And what's more, since
only one island message is processed at a time by the controller, and
all run until completion before the next is started, we know that there
are no island objects being manipulated at the time the #snapshot is
taken.  That's why it's sure to be safe.  (The problem from *1*, above,
can still occur BEFORE the snapshot is ever asked for. Such bugs have to
be hunted down and killed.)

Peter Moore wrote:
> FWIW - The checkpoint failures have been occurring even when I'm not
> interacting with the croquet window. In fact it usually happens when
> I'm not even working in squeak. I started the "auto-save" (which forks
> a new squeak process which does the checkpointing) and let it run
> while I did other stuff. So I think that should rule out any keyboard
> or pointer events or Tweak menus.
So, while it may not be Croquet or Tweak events that are getting
interleaved with island processing, there may well be other processes
that are getting interleaved.  How can we tell if this "auto save"
process is getting interleaved?  Also, is your unattended squeak still
rendering? Rendering includes both off-island rendering (of the moprh
and of tweak) and on-island rendering (of each visible island).  If
these are getting interrupted or yielding, how do we know the snapshot
isn't occurring at the same time?  (That's a real question. Not rhetorical.)

>
> Is there any potential problems with manipulating the controller's
> eventQueue directly? I'm assuming that this is where the queued up
> messages from the router are stored. Will messing with it affect
> synchronization?
yes, it can. A message could be on its way to you from the router,
timestamped BEFORE some item (e.g., an internal future message from a
simulation) that you already have on your queue. The existing controller
machinery sorts the messages by timestamp when inserting messages.

When you take your snapshot, it will record the island time and your
pending queue. If the message to take the snapshot came from the router
(as we did, and as the built-in sync message does), then we know that it
will preserve the same order that everyone else gets.

But if you just add the message to the queue yourself, you'll have to
think about where it goes, and at what timestamp. That's what made my
brain hurt and so I let the router figure it out.

> What is the magic that is happening in the event loop that solves our
> problem?

See ***, above.

>
> On Aug 29, 2006, at 9:27 PM, Howard Stearns wrote:
>
>> In our experiences, I think there were a couple of issues, all based
>> on the having the snapshot occur during island rendering, which is
>> on-island.
>>
>> 1. Tweak scripts run at a high priority, so Tweak menus can happen
>> during rendering.
>> Our first attempt to deal with this was to have the Tweak menu action
>> fork off a separate process:
>>     [controller doTheSnasphotThing] forkAt: Processor
>> userSchedulingPriority + 2.
>>
>> 2. Can keyboard and pointer events interrupt rendering? Especially
>> considering that the handlers for these events might unintentionally
>> suspend. (At the time, we didn't know, e.g., about Transcript doing
>> refreshWorld, which may not be an example of this, but shows the kind
>> of thing that can happen. See my 2006-08-14 message.)
>>
>> The fork hack helped quite a bit, but we did still occasionally run
>> into problems. The only sure-fire way to avoid them was to have the
>> controller doTheSnapshotThing as part of its message processing.    
>> Josh mentioned that this could maybe have been done with something like:
>>     controller eventQueue nextPut: (MessageSend receiver: controller
>> selector: #doTheSnapshotThing)
>> instead of making a round trip to the router.  I can't remember why I
>> thought I had trouble inserting stuff  into the queue -- so go ahead
>> and try it!  I think maybe it had something to do with getting it
>> inserted in the proper sequence so that the frozen island time
>> included any messages that would eventually be timestamped by the
>> router before the snapshot time.  For example, the #nextPut:, above,
>> is backwards. It would have to be inserted at the other end.  But
>> give it a shot.
>>
>> I think that locking both controller and island during the snapshot
>> won't fix these problems, unless rendering locks on the same objects.
>> But that would be equivalent to doing the snapshot through the
>> controller queue  (which seems to me to be a more elegant way of
>> "locking").
>>
>>
>> On Aug 29, 2006, at 6:51 PM, Andreas Raab wrote:
>>
>>> Peter Moore wrote:
>>>> Hey Andreas,
>>>> I've attached the rather large SnapshotTracer log as a text file.
>>>> In the future, what should I be looking for? Thanks for your help.
>>>
>>> The key observation is here:
>>>
>>>     root: Smalltalk specialObjects (Array)
>>>     4: Association
>>>     value: ProcessorScheduler
>>>     quiescentProcessLists: Array
>>>     40: LinkedList
>>>     firstLink: Process
>>>     suspendedContext: MethodContext
>>>     receiver: TAvatarReplicaMotion
>>>
>>> This tells us that you were trying to execute a snapshot from a
>>> concurrent process while some other process was executing in the
>>> island (we see this in the above since the receiver is an avatar
>>> replica belonging to the island).
>>>
>>> The bottom line being that indeed it's indeed a bug - we probably
>>> need to lock both controller and island in the snapshot.
>>>
>>> Cheers,
>>>   - Andreas
>>>
>>
>

--
Howard Stearns
University of Wisconsin - Madison
Division of Information Technology
mailto:[hidden email]
jabber:[hidden email]
voice:+1-608-262-3724


Reply | Threaded
Open this post in threaded view
|

Re: Island snapshots

Andreas.Raab
In reply to this post by Peter Moore-5
Hi Howard -

Howard Stearns wrote:
> Sorry, I should have been clearer, but I don't understand it well enough
> yet. (And so I get verbose. But hey, I'm trying to talk my way through
> to understanding it.) I'm speculating that the fundamental issue is that
> an island process is getting interleaved with non-island process.

Actually there is no need to speculate. The problem comes from the
effect that syncSend:'s don't lock the island. You do this from multiple
processes and you are hosed. In a later version I have already fixed
that but that version is a little to instable right now for broader
consumption.

Cheers,
   - Andreas