What are the "rules" regarding when it is safe to take a snapshot of
an island? I'd like to be able to save islands (or in this case specifically croquet spaces) on demand – via a menu item – or periodically as a backup mechanism for persistent spaces. Saving the island on demand has been working now for awhile without any problems, but maybe I've just been getting lucky? I've added a new method to the harness that returns the island data: activeIslandData ^ (self activeSpace island get: #controller) snapshot This was hooked up to a menu item and the data is stored in a file. No problems. Next I created a new process that would call this method every 30 seconds (just for testing purposes). Eventually I will get a Checkpoint Failure. After some debugging I see that after IslandWriter>>storeSegmentFor:into:outPointers: is called that there will be something that isn't filtered out. In this case it was a TObjectID that mapped to a TAvatarReplica. I won't pretend I fully understand what is happening during the snapshot process so that is about the best I can do in describing the situation (at one point I tried to call TIslandController>>snapshot from the space itself which really didn't work). I've looked at what Wisconsin is doing with WiscWorlds, basically they send requests for a snapshot of the island through the router. Is this really necessary? If so can someone explain to me the reason why? This approach requires (I believe) adding to the facet list on both the controller and router. Also it seems some class methods had to be modified. Do you really need to do all of this to safely snapshot an island? -Peter |
On Aug 28, 2006, at 4:14 PM, Peter Moore wrote: > What are the "rules" regarding when it is safe to take a snapshot > of an island? I'd like to be able to save islands (or in this case > specifically croquet spaces) on demand – via a menu item – or > periodically as a backup mechanism for persistent spaces. Saving > the island on demand has been working now for awhile without any > problems, but maybe I've just been getting lucky? I've added a new > method to the harness that returns the island data: > > activeIslandData > > ^ (self activeSpace island get: #controller) snapshot > > This was hooked up to a menu item and the data is stored in a file. > No problems. Next I created a new process that would call this > method every 30 seconds (just for testing purposes). Eventually I > will get a Checkpoint Failure. After some debugging I see that > after IslandWriter>>storeSegmentFor:into:outPointers: is called > that there will be something that isn't filtered out. In this case > it was a TObjectID that mapped to a TAvatarReplica. I won't pretend > I fully understand what is happening during the snapshot process so > that is about the best I can do in describing the situation (at one > point I tried to call TIslandController>>snapshot from the space > itself which really didn't work). > > I've looked at what Wisconsin is doing with WiscWorlds, basically > they send requests for a snapshot of the island through the router. > Is this really necessary? If so can someone explain to me the > reason why? This approach requires (I believe) adding to the facet > list on both the controller and router. Also it seems some class > methods had to be modified. Do you really need to do all of this to > safely snapshot an island? > I'm not sure that what we did isn't a bit of overkill. It seems to me that we could achieve our goal just as safely by directly adding our request to the controller's 'eventQueue'. The reason we did this is because we tried doing the same thing that you are, and ran into the same problems. I don't remember now if we involve the router for a good reason, or because we missed the obvious. Josh > -Peter > > > |
In reply to this post by Peter Moore-5
Peter Moore wrote:
> This was hooked up to a menu item and the data is stored in a file. No > problems. Next I created a new process that would call this method every > 30 seconds (just for testing purposes). Eventually I will get a > Checkpoint Failure. After some debugging I see that after > IslandWriter>>storeSegmentFor:into:outPointers: is called that there > will be something that isn't filtered out. In this case it was a > TObjectID that mapped to a TAvatarReplica. You should have been presented with a workspace's full of information about the problem. Do you still have that info (or can reproduce it)? I'd be interested in seeing where the snapshot process fails - in theory it should be possible to run it from anywhere except inside the island you're snapshotting so I think this is a bug. > I've looked at what Wisconsin is doing with WiscWorlds, basically they > send requests for a snapshot of the island through the router. Is this > really necessary? If so can someone explain to me the reason why? This > approach requires (I believe) adding to the facet list on both the > controller and router. Also it seems some class methods had to be > modified. Do you really need to do all of this to safely snapshot an > island? Seems excessive. Cheers, - Andreas |
In reply to this post by Peter Moore-5
Peter Moore wrote:
> Hey Andreas, > > I've attached the rather large SnapshotTracer log as a text file. In the > future, what should I be looking for? Thanks for your help. The key observation is here: root: Smalltalk specialObjects (Array) 4: Association value: ProcessorScheduler quiescentProcessLists: Array 40: LinkedList firstLink: Process suspendedContext: MethodContext receiver: TAvatarReplicaMotion This tells us that you were trying to execute a snapshot from a concurrent process while some other process was executing in the island (we see this in the above since the receiver is an avatar replica belonging to the island). The bottom line being that indeed it's indeed a bug - we probably need to lock both controller and island in the snapshot. Cheers, - Andreas |
In reply to this post by Peter Moore-5
In our experiences, I think there were a couple of issues, all based
on the having the snapshot occur during island rendering, which is on- island. 1. Tweak scripts run at a high priority, so Tweak menus can happen during rendering. Our first attempt to deal with this was to have the Tweak menu action fork off a separate process: [controller doTheSnasphotThing] forkAt: Processor userSchedulingPriority + 2. 2. Can keyboard and pointer events interrupt rendering? Especially considering that the handlers for these events might unintentionally suspend. (At the time, we didn't know, e.g., about Transcript doing refreshWorld, which may not be an example of this, but shows the kind of thing that can happen. See my 2006-08-14 message.) The fork hack helped quite a bit, but we did still occasionally run into problems. The only sure-fire way to avoid them was to have the controller doTheSnapshotThing as part of its message processing. Josh mentioned that this could maybe have been done with something like: controller eventQueue nextPut: (MessageSend receiver: controller selector: #doTheSnapshotThing) instead of making a round trip to the router. I can't remember why I thought I had trouble inserting stuff into the queue -- so go ahead and try it! I think maybe it had something to do with getting it inserted in the proper sequence so that the frozen island time included any messages that would eventually be timestamped by the router before the snapshot time. For example, the #nextPut:, above, is backwards. It would have to be inserted at the other end. But give it a shot. I think that locking both controller and island during the snapshot won't fix these problems, unless rendering locks on the same objects. But that would be equivalent to doing the snapshot through the controller queue (which seems to me to be a more elegant way of "locking"). On Aug 29, 2006, at 6:51 PM, Andreas Raab wrote: > Peter Moore wrote: >> Hey Andreas, >> I've attached the rather large SnapshotTracer log as a text file. >> In the future, what should I be looking for? Thanks for your help. > > The key observation is here: > > root: Smalltalk specialObjects (Array) > 4: Association > value: ProcessorScheduler > quiescentProcessLists: Array > 40: LinkedList > firstLink: Process > suspendedContext: MethodContext > receiver: TAvatarReplicaMotion > > This tells us that you were trying to execute a snapshot from a > concurrent process while some other process was executing in the > island (we see this in the above since the receiver is an avatar > replica belonging to the island). > > The bottom line being that indeed it's indeed a bug - we probably > need to lock both controller and island in the snapshot. > > Cheers, > - Andreas > |
In reply to this post by Peter Moore-5
FWIW - The checkpoint failures have been occurring even when I'm not
interacting with the croquet window. In fact it usually happens when I'm not even working in squeak. I started the "auto-save" (which forks a new squeak process which does the checkpointing) and let it run while I did other stuff. So I think that should rule out any keyboard or pointer events or Tweak menus. Is there any potential problems with manipulating the controller's eventQueue directly? I'm assuming that this is where the queued up messages from the router are stored. Will messing with it affect synchronization? What is the magic that is happening in the event loop that solves our problem? On Aug 29, 2006, at 9:27 PM, Howard Stearns wrote: > In our experiences, I think there were a couple of issues, all > based on the having the snapshot occur during island rendering, > which is on-island. > > 1. Tweak scripts run at a high priority, so Tweak menus can happen > during rendering. > Our first attempt to deal with this was to have the Tweak menu > action fork off a separate process: > [controller doTheSnasphotThing] forkAt: Processor > userSchedulingPriority + 2. > > 2. Can keyboard and pointer events interrupt rendering? Especially > considering that the handlers for these events might > unintentionally suspend. (At the time, we didn't know, e.g., about > Transcript doing refreshWorld, which may not be an example of this, > but shows the kind of thing that can happen. See my 2006-08-14 > message.) > > The fork hack helped quite a bit, but we did still occasionally run > into problems. The only sure-fire way to avoid them was to have the > controller doTheSnapshotThing as part of its message processing. > Josh mentioned that this could maybe have been done with something > like: > controller eventQueue nextPut: (MessageSend receiver: > controller selector: #doTheSnapshotThing) > instead of making a round trip to the router. I can't remember why > I thought I had trouble inserting stuff into the queue -- so go > ahead and try it! I think maybe it had something to do with > getting it inserted in the proper sequence so that the frozen > island time included any messages that would eventually be > timestamped by the router before the snapshot time. For example, > the #nextPut:, above, is backwards. It would have to be inserted at > the other end. But give it a shot. > > I think that locking both controller and island during the snapshot > won't fix these problems, unless rendering locks on the same > objects. But that would be equivalent to doing the snapshot through > the controller queue (which seems to me to be a more elegant way > of "locking"). > > > On Aug 29, 2006, at 6:51 PM, Andreas Raab wrote: > >> Peter Moore wrote: >>> Hey Andreas, >>> I've attached the rather large SnapshotTracer log as a text file. >>> In the future, what should I be looking for? Thanks for your help. >> >> The key observation is here: >> >> root: Smalltalk specialObjects (Array) >> 4: Association >> value: ProcessorScheduler >> quiescentProcessLists: Array >> 40: LinkedList >> firstLink: Process >> suspendedContext: MethodContext >> receiver: TAvatarReplicaMotion >> >> This tells us that you were trying to execute a snapshot from a >> concurrent process while some other process was executing in the >> island (we see this in the above since the receiver is an avatar >> replica belonging to the island). >> >> The bottom line being that indeed it's indeed a bug - we probably >> need to lock both controller and island in the snapshot. >> >> Cheers, >> - Andreas >> > |
In reply to this post by Peter Moore-5
Sorry, I should have been clearer, but I don't understand it well enough
yet. (And so I get verbose. But hey, I'm trying to talk my way through to understanding it.) I'm speculating that the fundamental issue is that an island process is getting interleaved with non-island process. By island process, I mean something that is intended to be executed with the current island for the process bound to the island being snapshotted. By non-island process, I mean something that is intended to be executed with anything else (typically island default, aka, Squeak or "the ocean") as the current island. An island process should not manipulate a non-island object directly, and vice versa. But most object's don't actually know what island they are supposed to be part of. As I understand it, "island discipline" is actually implemented as, or depends on, getting the "process discipline" right. Now, the participant and harness have a very nice set of processes that gets everything right. If you only execute island stuff by handling a message which the island controller told you to handle, then everything should work right. But what happens when this machinery is correctly processing a controller's messages -- which cause the on-island stuff to happen in the right context -- and it either gets pre-empted by a higher priority off-island process, or it yields to some other off-island process that was already running? Answer: I don't know! But my claim is that this can lead to off-island process running with the island as its current island, or vice versa. *1* This can be a problem if new objects get created by the wrong process, and therefore are not in the right island. It causes far refs to appear that should be near refs. (The other mail message I cited, below.) *2* But quite often, I THINK such interleaving ends up being harmless -- UNLESS you happen to take your snapshot while they are interleaved. Snapshotting is an off-island process that must not happen while in the middle of running island stuff. If you did, then there could be island objects directly on the stack (not guarded by far refs) or in other places that are traceable from the ocean roots. If these processes interleave while you're not snapshotting -- who cares? But if you are snapshotting, the machinery will see this and barf. *** Note that the handling of Andreas' sync message from the router, and the handling of my caching message from the router, are both handled by the controller directly. No on-island objects are referenced! And what's more, since only one island message is processed at a time by the controller, and all run until completion before the next is started, we know that there are no island objects being manipulated at the time the #snapshot is taken. That's why it's sure to be safe. (The problem from *1*, above, can still occur BEFORE the snapshot is ever asked for. Such bugs have to be hunted down and killed.) Peter Moore wrote: > FWIW - The checkpoint failures have been occurring even when I'm not > interacting with the croquet window. In fact it usually happens when > I'm not even working in squeak. I started the "auto-save" (which forks > a new squeak process which does the checkpointing) and let it run > while I did other stuff. So I think that should rule out any keyboard > or pointer events or Tweak menus. So, while it may not be Croquet or Tweak events that are getting interleaved with island processing, there may well be other processes that are getting interleaved. How can we tell if this "auto save" process is getting interleaved? Also, is your unattended squeak still rendering? Rendering includes both off-island rendering (of the moprh and of tweak) and on-island rendering (of each visible island). If these are getting interrupted or yielding, how do we know the snapshot isn't occurring at the same time? (That's a real question. Not rhetorical.) > > Is there any potential problems with manipulating the controller's > eventQueue directly? I'm assuming that this is where the queued up > messages from the router are stored. Will messing with it affect > synchronization? yes, it can. A message could be on its way to you from the router, timestamped BEFORE some item (e.g., an internal future message from a simulation) that you already have on your queue. The existing controller machinery sorts the messages by timestamp when inserting messages. When you take your snapshot, it will record the island time and your pending queue. If the message to take the snapshot came from the router (as we did, and as the built-in sync message does), then we know that it will preserve the same order that everyone else gets. But if you just add the message to the queue yourself, you'll have to think about where it goes, and at what timestamp. That's what made my brain hurt and so I let the router figure it out. > What is the magic that is happening in the event loop that solves our > problem? See ***, above. > > On Aug 29, 2006, at 9:27 PM, Howard Stearns wrote: > >> In our experiences, I think there were a couple of issues, all based >> on the having the snapshot occur during island rendering, which is >> on-island. >> >> 1. Tweak scripts run at a high priority, so Tweak menus can happen >> during rendering. >> Our first attempt to deal with this was to have the Tweak menu action >> fork off a separate process: >> [controller doTheSnasphotThing] forkAt: Processor >> userSchedulingPriority + 2. >> >> 2. Can keyboard and pointer events interrupt rendering? Especially >> considering that the handlers for these events might unintentionally >> suspend. (At the time, we didn't know, e.g., about Transcript doing >> refreshWorld, which may not be an example of this, but shows the kind >> of thing that can happen. See my 2006-08-14 message.) >> >> The fork hack helped quite a bit, but we did still occasionally run >> into problems. The only sure-fire way to avoid them was to have the >> controller doTheSnapshotThing as part of its message processing. >> Josh mentioned that this could maybe have been done with something like: >> controller eventQueue nextPut: (MessageSend receiver: controller >> selector: #doTheSnapshotThing) >> instead of making a round trip to the router. I can't remember why I >> thought I had trouble inserting stuff into the queue -- so go ahead >> and try it! I think maybe it had something to do with getting it >> inserted in the proper sequence so that the frozen island time >> included any messages that would eventually be timestamped by the >> router before the snapshot time. For example, the #nextPut:, above, >> is backwards. It would have to be inserted at the other end. But >> give it a shot. >> >> I think that locking both controller and island during the snapshot >> won't fix these problems, unless rendering locks on the same objects. >> But that would be equivalent to doing the snapshot through the >> controller queue (which seems to me to be a more elegant way of >> "locking"). >> >> >> On Aug 29, 2006, at 6:51 PM, Andreas Raab wrote: >> >>> Peter Moore wrote: >>>> Hey Andreas, >>>> I've attached the rather large SnapshotTracer log as a text file. >>>> In the future, what should I be looking for? Thanks for your help. >>> >>> The key observation is here: >>> >>> root: Smalltalk specialObjects (Array) >>> 4: Association >>> value: ProcessorScheduler >>> quiescentProcessLists: Array >>> 40: LinkedList >>> firstLink: Process >>> suspendedContext: MethodContext >>> receiver: TAvatarReplicaMotion >>> >>> This tells us that you were trying to execute a snapshot from a >>> concurrent process while some other process was executing in the >>> island (we see this in the above since the receiver is an avatar >>> replica belonging to the island). >>> >>> The bottom line being that indeed it's indeed a bug - we probably >>> need to lock both controller and island in the snapshot. >>> >>> Cheers, >>> - Andreas >>> >> > -- Howard Stearns University of Wisconsin - Madison Division of Information Technology mailto:[hidden email] jabber:[hidden email] voice:+1-608-262-3724 |
In reply to this post by Peter Moore-5
Hi Howard -
Howard Stearns wrote: > Sorry, I should have been clearer, but I don't understand it well enough > yet. (And so I get verbose. But hey, I'm trying to talk my way through > to understanding it.) I'm speculating that the fundamental issue is that > an island process is getting interleaved with non-island process. Actually there is no need to speculate. The problem comes from the effect that syncSend:'s don't lock the island. You do this from multiple processes and you are hosed. In a later version I have already fixed that but that version is a little to instable right now for broader consumption. Cheers, - Andreas |
Free forum by Nabble | Edit this page |