Here is some more information about the bug I reported November 24. First,
the symptom is much worse than it seems. After you use Ctrl-Break to get out of it, your image is not restored to normal. You can no longer use any delays at all. Second, the cause of the bug is cumulative. The code shown below is just one way to make it happen. It can also happen as a result of using one image for doing a lot of miscellaneous development work involving processes. This code is not intended as any kind of benchmark or other such test. It's only to show this bug. It was the result of spending a lot of time finding a way to make the bug repeatable. When I posted it November 24, I may have given the impression that this bug was just an interesting quirk. But it's a serious bug which makes Dolphin unreliable for doing development work involving large numbers of processes. I hope Object Arts is giving it high priority and will soon be able to explain what caused it. Here is the code again to make it happen. Be sure to discard the image after running this in a workspace: | procs | procs := OrderedCollection new. 50 timesRepeat: [ procs add: [ 50 timesRepeat: [Processor sleep: 1]. ] fork. ]. procs do: [:proc | proc terminate]. Processor sleep: 100. Processor sleep: 100. |
I can't tell you exactly what's going wrong, but here's a few things to help
keep you working, and a step towards an explanation of what's happening. First, when you execute your script and it locks up Dolphin; in case you didn't already know, you can break out of that by doing <CONTROL><BREAK>, and then "terminate" the resulting walkback. That leaves the system in a state where any other Delay>>wait, or Processor>>sleep: will lock up again. This is because the AccessProtect semaphore used internally by the Delay class, has been left in an improper state. To get back to a normal state you can try the following (which works reliably for me). You need to define a new method on the class side of Delay: ================ prod2 "Private - Forcibly reset the mutex to it's 'normal' state" AccessProtect set. ================ You can now execute (Delay prod2) a few times to wake up any processes which are wrongly still sleeping on it (there may be more than one, so I execute it several times, I don't know if that's necessary). That will also leave the Semaphore in the correct 'signalled' state. It'd probably be an idea to execute (Delay prod) too. It should now be possible to execute (Processor sleep: 1) without any ill effects. If so then your image is probably healthy again. Now, as to what's going wrong. I don't have the complete story, but the lines: ================ procs := OrderedCollection new. 50 timesRepeat: [ procs add: [ 50 timesRepeat: [Processor sleep: 1]. ] fork. ]. procs do: [:proc | proc terminate]. procs := nil. [:proc | ] value: nil. ================ reliably leave the Delay class's AccessProtect semaphore with a few extra signals (typically about half-a-dozen on this machine). Once that has happened it will let more than one process execute the critical section at the same time, and sooner or later this will result in the Delay stuff breaking. In fact, executing the following (all at one go) will leave the Semaphore with one extra signal (which doesn't immediately cause visible problems, but it will allow them to occur later since the Delay class is no longer threadsafe): ================ p1 := [Processor sleep: 1000000] fork. p2 := [Processor sleep: 1000000] fork. p1 terminate. p2 terminate. ================ This leaves the AccessProtect Semaphore with 2 signals against it instead of 1. (Blair, if you haven't already solved the problem by the time you read this, then you'll be relieved to know that it reproduces reliably on both my slowish W98 machine and my fastish Win2K machine.) It is necessary to have at least 2 processes for this to "work". I put a bit of tracing in, and it appears that the first process enters the critical section in Delay>>wait OK, and that everything's still OK after it has been interrupted by the #terminate. The second process doesn't get into the critical section at all, but unless it is started, and #terminated, the problem doesn't occur. Given that, I *suspect* that the problem is being caused when the second process attempts to enter the critical section, but is interrupted before it's really got into it. The Semaphore gets signalled, so I think that the underlying call to Semaphore>>wait:ret: must have started and then "returned" WAIT_OBJECT_O. But the VM apparently hadn't managed to decrement its signal-count by the time the process was interrupted. The nett effect is then to increment its signal-count wrongly. If I'm right, then it looks as if it requires a VM fix. BTW, three suggestions for OA: 1) Should Process>>sleep: use an #ensure: block to clear it's Delay object when it is terminated ? As it is, the Delay instance continues to hang around after termination, which doesn't cause malfunctions, as such, but is untidy and hangs on to unnecessary references. 2) It'd be nice is the Process Monitor showed counts of the active/dead/etc processes in it's caption. 3) Can you arrange for the Process Monitor to preserve the selection and scroll position when it refreshes itself please ? (Otherwise its impossible to terminate processes a long way down the list.) -- chris |
Chris,
> 3) Can you arrange for the Process Monitor to preserve the selection and > scroll position when it refreshes itself please ? (Otherwise its impossible > to terminate processes a long way down the list.) As a workaround, can you simply slow down the refresh rate to give yourself time to get there? IIRC, there's an option to set the rate. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
Bill,
> As a workaround, can you simply slow down the refresh rate to give yourself > time to get there? IIRC, there's an option to set the rate. True, or even to pause it alltogther. Cheers. > Bill -- chris |
Free forum by Nabble | Edit this page |