32bit clean VM work.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

32bit clean VM work.

johnmci
 
Some of you might know that David T. Lewis has been working on  
changes to the VM source to make it work fully within 32 or 64 bit  
address spaces.

As we know the Squeak VM treated memory address which are unsigned  
values as signed integer values. This wrong usage of signed math in  
compare statments or do loops which would cause the VM to make an  
incorrect decision resulting in corrupted memory and causing the VM  
to crash.

This issue would usually occur if you wanted to use 1GB of memory for  
your VM and the host operating system would then allocate memory for  
you above the 2GB boundary, or at say the 1.5GB boundary. Resulting  
either in an instant crash, or a crash much later when your memory  
needs caused the VM to grow over the 2GB boundary.

Some fixes were done in the past to make the VM mostly run when fully  
over the 2GB boundary but at best they were insufficient patches.

Over the last couple of days I reviewed David Lewis' changes, plus  
made some fixes, and revised the macintosh os-x support files, plus  
worked up some general test cases to see what happens when you run  
the macro bench marks below the 2GB boundary, crossing the 2GB  
boundary, and when the image is allocated at the 3GB boundary.

This afternoon I'm pleased to say the VM passed all runs of my  
trivial test cases, so I have check in the Mac OS carbon source code  
changes and David's changes to the Mac OS source tree for further  
review.

People wanting to build a VM should review the Mac OS build  
instructions to build a Mac OS carbon VM, or review the required  
changes to VMMaker as per the Carbon VM build readme to build a 32bit  
clean VM.

I have not:

(a) build a 64 bit VM and tested it.

VM developers should consider the mmap call in the memory allocation  
routine, you can specify a suggested starting position. On OS-X I was  
able to chose  1GB, 1.5GB, 2GB and 3GB.  I have not tested 64bit VMs  
at the 0x8000000000000000 boundary.  I suspect you could allocate at  
the 0x7FFFFFFFF0000000 Then ask for 600MB of memory for the image.  
That would set the end of memory at 0x8000000015800000, 344MB over  
the negative sign boundary.

(b) I have not tested or reviewed any of the external plugins for  
improper use of usqInt.

(c) I have not confirmed the changes work with the Unix VM, or the  
Windows VM, I have no plans to do so.

--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===


Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

Damien Cassou-3
 
2007/6/10, John M McIntosh <[hidden email]>:

> (c) I have not confirmed the changes work with the Unix VM, or the
> Windows VM, I have no plans to do so.

What can we do to test the unix vm?


--
Damien Cassou
Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

Philippe Marschall
In reply to this post by johnmci
 
You make me want to buy a Mac.

Cheers
Philippe

2007/6/10, John M McIntosh <[hidden email]>:

>
> Some of you might know that David T. Lewis has been working on
> changes to the VM source to make it work fully within 32 or 64 bit
> address spaces.
>
> As we know the Squeak VM treated memory address which are unsigned
> values as signed integer values. This wrong usage of signed math in
> compare statments or do loops which would cause the VM to make an
> incorrect decision resulting in corrupted memory and causing the VM
> to crash.
>
> This issue would usually occur if you wanted to use 1GB of memory for
> your VM and the host operating system would then allocate memory for
> you above the 2GB boundary, or at say the 1.5GB boundary. Resulting
> either in an instant crash, or a crash much later when your memory
> needs caused the VM to grow over the 2GB boundary.
>
> Some fixes were done in the past to make the VM mostly run when fully
> over the 2GB boundary but at best they were insufficient patches.
>
> Over the last couple of days I reviewed David Lewis' changes, plus
> made some fixes, and revised the macintosh os-x support files, plus
> worked up some general test cases to see what happens when you run
> the macro bench marks below the 2GB boundary, crossing the 2GB
> boundary, and when the image is allocated at the 3GB boundary.
>
> This afternoon I'm pleased to say the VM passed all runs of my
> trivial test cases, so I have check in the Mac OS carbon source code
> changes and David's changes to the Mac OS source tree for further
> review.
>
> People wanting to build a VM should review the Mac OS build
> instructions to build a Mac OS carbon VM, or review the required
> changes to VMMaker as per the Carbon VM build readme to build a 32bit
> clean VM.
>
> I have not:
>
> (a) build a 64 bit VM and tested it.
>
> VM developers should consider the mmap call in the memory allocation
> routine, you can specify a suggested starting position. On OS-X I was
> able to chose  1GB, 1.5GB, 2GB and 3GB.  I have not tested 64bit VMs
> at the 0x8000000000000000 boundary.  I suspect you could allocate at
> the 0x7FFFFFFFF0000000 Then ask for 600MB of memory for the image.
> That would set the end of memory at 0x8000000015800000, 344MB over
> the negative sign boundary.
>
> (b) I have not tested or reviewed any of the external plugins for
> improper use of usqInt.
>
> (c) I have not confirmed the changes work with the Unix VM, or the
> Windows VM, I have no plans to do so.
>
> --
> ========================================================================
> ===
> John M. McIntosh <[hidden email]>
> Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
> ========================================================================
> ===
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

johnmci
In reply to this post by Damien Cassou-3
 
Well follow the instructions to build a unix VM.

At the point where you use VMMaker to make the VM ensure you go to
Mac OS/vm/specialChangeSets and load

ArraysToGlobalStruct-JMM.1.cs May already be in image, check source.
bigCursor-bf.1.cs
JMM-fixBiasToGrow.1.cs.zip
VMM38-64bit-imageUpdates.1.cs May already be in image, check source.
VMM38-gc-instrument-image.1.cs May already be in image, check source.
VmUpdates-dtl
        VmUpdates-1001-dtl.1.cs
        VmUpdates-1002-dtl.1.cs
        VmUpdates-1003-dtl.1.cs
        VmUpdates-1004-dtl.1.cs
        VmUpdates-1005-dtl.1.cs
        VmUpdates-1006-dtl.1.cs
        JMM-VmUpdates32bitclean.2.cs

For 64bit work no idea, wasn't there some issue with fixes needed to  
build it anyway?

Once the VM is build to test I suggest you look at the call to mmap  
in the unix memory allocation source sqUnixMemory.c
and set the start location from zero to say 1.5GB, then startup your  
VM and ask for 600MB of memory.  In uxGrowMemoryBy look at the
value for  heap + heapSize to see where the heap ends to ensure your  
choices are correct.


I then downloaded a 3.5 image since it contains the macrobenchmarks,  
and ran

| suck |
suck := OrderedCollection new.
        suck add: (ByteArray new: 1024*1024*480).
97 timesRepeat: [
        Smalltalk macroBenchmarks.
        suck add: (ByteArray new: 1024*1024*1).
        Transcript show: Smalltalk garbageCollectMost;cr.
        Transcript show: Smalltalk garbageCollect;cr].


By adjust the 1024*1024*480) you want to put the entire active vm  
memory heap under the 2GB boundary, then the timesRepeat: loop  
allocates memory and
runs the benchmarks to cross over the boundary.    Modifications,  
using a smaller value for 1024*1024*1, really this should be 4 bytes  
in order to march over the boundary in more possible conditions  
however running it would require on the order of 4 million  
iterations. Maybe someone could devote a week and run with a 4K  
allocation.

As mentioned earlier

  I have not tested 64bit VMs at the 0x8000000000000000 boundary.  I  
suspect you could allocate at the 0x7FFFFFFFF0000000 Then ask for  
600MB of memory for the image. That would set the end of memory at  
0x8000000015800000, 344MB over the negative sign boundary, adjusting  
the initial memory allocation to find the boundary.


The other two test cases are the image below the 2gb boundary, which  
should be the first couple of running benchmarks for the VM, and the  
VM fully over the 2GB boundary which can be set by adjust mmap to say  
3GB.


On Jun 10, 2007, at 3:21 AM, Damien Cassou wrote:

> 2007/6/10, John M McIntosh <[hidden email]>:
>
>> (c) I have not confirmed the changes work with the Unix VM, or the
>> Windows VM, I have no plans to do so.
>
> What can we do to test the unix vm?
>
>
> --
> Damien Cassou

--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===


Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

David T. Lewis
In reply to this post by Damien Cassou-3
 
On Sun, Jun 10, 2007 at 12:21:05PM +0200, Damien Cassou wrote:
>
> 2007/6/10, John M McIntosh <[hidden email]>:
>
> >(c) I have not confirmed the changes work with the Unix VM, or the
> >Windows VM, I have no plans to do so.
>
> What can we do to test the unix vm?
>

Well, since you asked ;)

Following up on John's OS X work, I have now built Unix VMs on both
32-bit and 64-bit systems without problems.

The good news is that everything works fine on both platforms, even
when I set the base of heap memory to just below 2MB as John suggested
for testing.

The bad news is that I cannot get it to fail. My 32-bit system is
an older 2.4 Linux kernel, which refuses to mmap things at the
requested locations and therefore does not have a problem. On
the 64-bit (2.6 kernel) system, I can allocate heap below 2MB, and
Squeak is perfectly happy. There probably has never been any issue
on 64-bit Linux systems as far as I can tell (but you do need to
also load the fix from Mantis 5688 if you are building for a
64-bit system, and some others if you want to run an actual 64-bit
image).

So here is what is needed:

We need someone with a Linux system that *does* have the memory
problem, such that (for example) your Seaside application will
crash if you do not run it with the "-memory" option. On that same
system, build a new VM with the latest Subversion sources, with
VMMaker from SqueakMap, plus the fileins that John has provided
in the "platforms/Mac OS/vm/specialChangeSets/VmUpdates-dtl"
directory. No other fileins should be necessary on a 32-bit system,
so if you can build this VM, then run the Seaside application
without using a "-memory" option on the Squeak command line, then
we are probably in good shape.

It is quite likely that this *will* work, but someone with a
newer 32-bit Linux system will need to confirm it.

Note that some follow-up testing will probably be needed to try
forcing Squeak memory allocation at certain specific locations
(i.e.  right below the 2BM address boundary), but just building
and running a new VM to see if it makes a known problem go away
would be a big help.

Thanks,

Dave
 
Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

johnmci
 
> The good news is that everything works fine on both platforms, even
> when I set the base of heap memory to just below 2MB as John suggested
> for testing.

Well I assume you mean 2GB for 32bit systems, but for 64bit you need  
to get up to the 0x8000000000000000 boundary.

>
> The bad news is that I cannot get it to fail. My 32-bit system is
> an older 2.4 Linux kernel, which refuses to mmap things at the
> requested locations and therefore does not have a problem.


In all the crash cases we see the stack context go over the 2gb  
boundary expressed as negative values
in the the VM stack traces. Since we know the for() loop in  
incCompMove trashs memory when you walk an object move over
the 2GB boundary what you really need is to confirm the image works  
fine when it starts under 2gb, and ends over 2gb.

That and that really big number for 64bit systems.



--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===


Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

Michael Rueger-6
In reply to this post by David T. Lewis
 
David T. Lewis wrote:

> The bad news is that I cannot get it to fail. My 32-bit system is

I have a debian system on VMware that so far has reliably failed without
the -memory.
If you send me your VM I can give it a try.

Michael

Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

David T. Lewis
In reply to this post by johnmci
 
On Tue, Jun 12, 2007 at 12:12:16AM -0700, John M McIntosh wrote:
> >The good news is that everything works fine on both platforms, even
> >when I set the base of heap memory to just below 2MB as John suggested
> >for testing.
>
> Well I assume you mean 2GB for 32bit systems, but for 64bit you need  
> to get up to the 0x8000000000000000 boundary.

On the 64-bit system, it's not allowing me to use anything that high
in the address space. I can request a mmap to 0xfff00000000 and the
request will be honored:

uxAllocateMemory: heap requested at fff00000000, allocated at fff00000000

But if I request a higher location, it decides that I am being unreasonable
and uses its own assignment:

uxAllocateMemory: heap requested at ffff00000000, allocated at 2aca47532000

Thus I cannot say if there would be any issues at the 0x800000000000000
boundary, but I can say that this does not appear to be a possible
failure mode on current Linux implementations.

> >
> >The bad news is that I cannot get it to fail. My 32-bit system is
> >an older 2.4 Linux kernel, which refuses to mmap things at the
> >requested locations and therefore does not have a problem.
>
>
> In all the crash cases we see the stack context go over the 2gb  
> boundary expressed as negative values
> in the the VM stack traces. Since we know the for() loop in  
> incCompMove trashs memory when you walk an object move over
> the 2GB boundary what you really need is to confirm the image works  
> fine when it starts under 2gb, and ends over 2gb.

Right.

But I was too hasty. I saw that my mmap request did not work at
0x7FFFFFFF and assumed that my older Linux system did not honor
this, but I just tried again with:
#define SQBASE (0x7FFFFFFF - 2000000)

And this *did* work. Better yet, I can now reproduce the problem
reliably:

uxAllocateMemory: heap requested at 7fe17b7f, allocated at 7fe18000

sweep failed to find exact end of memory

-2126938800 SystemDictionary>garbageCollect

I will re-apply our fixes and see what happens.
I'm late for work though, so it may not be done this morning.

Dave

Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

David T. Lewis
In reply to this post by Michael Rueger-6
 
On Tue, Jun 12, 2007 at 09:39:48AM +0200, Michael Rueger wrote:
>
> David T. Lewis wrote:
>
> >The bad news is that I cannot get it to fail. My 32-bit system is
>
> I have a debian system on VMware that so far has reliably failed without
> the -memory.
> If you send me your VM I can give it a try.

Michael,

Thanks for the offer. I think I've got a repeatable failure case now
so I should be able to complete the testing on my system.

Dave

Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

David T. Lewis
In reply to this post by David T. Lewis
 
OK, I can now confirm that the changes work on 32-bit Linux also.
After applying the changes, then forcing the heap to this:
#define SQBASE (0x7FFFFFFF - 2000000)

I can run Squeak without the crash, allocating 300MB of strings in
the image, freeing them and doing a GC, all without problems:

lewis@dtlewis:/data3/lewis/squeak/sq/Squeak3.9> squeak withfixes
uxAllocateMemory: heap requested at 7fe17b7f, allocated at 7fe18000

So everything looks good on both 32-bit and 64-bit Linux. I have
not tried any 64-bit images, but would not expect any problem
there either.

Dave

On Tue, Jun 12, 2007 at 07:15:46AM -0400, David T. Lewis wrote:

>  
> On Tue, Jun 12, 2007 at 12:12:16AM -0700, John M McIntosh wrote:
> > >The good news is that everything works fine on both platforms, even
> > >when I set the base of heap memory to just below 2MB as John suggested
> > >for testing.
> >
> > Well I assume you mean 2GB for 32bit systems, but for 64bit you need  
> > to get up to the 0x8000000000000000 boundary.
>
> On the 64-bit system, it's not allowing me to use anything that high
> in the address space. I can request a mmap to 0xfff00000000 and the
> request will be honored:
>
> uxAllocateMemory: heap requested at fff00000000, allocated at fff00000000
>
> But if I request a higher location, it decides that I am being unreasonable
> and uses its own assignment:
>
> uxAllocateMemory: heap requested at ffff00000000, allocated at 2aca47532000
>
> Thus I cannot say if there would be any issues at the 0x800000000000000
> boundary, but I can say that this does not appear to be a possible
> failure mode on current Linux implementations.
>
> > >
> > >The bad news is that I cannot get it to fail. My 32-bit system is
> > >an older 2.4 Linux kernel, which refuses to mmap things at the
> > >requested locations and therefore does not have a problem.
> >
> >
> > In all the crash cases we see the stack context go over the 2gb  
> > boundary expressed as negative values
> > in the the VM stack traces. Since we know the for() loop in  
> > incCompMove trashs memory when you walk an object move over
> > the 2GB boundary what you really need is to confirm the image works  
> > fine when it starts under 2gb, and ends over 2gb.
>
> Right.
>
> But I was too hasty. I saw that my mmap request did not work at
> 0x7FFFFFFF and assumed that my older Linux system did not honor
> this, but I just tried again with:
> #define SQBASE (0x7FFFFFFF - 2000000)
>
> And this *did* work. Better yet, I can now reproduce the problem
> reliably:
>
> uxAllocateMemory: heap requested at 7fe17b7f, allocated at 7fe18000
>
> sweep failed to find exact end of memory
>
> -2126938800 SystemDictionary>garbageCollect
>
> I will re-apply our fixes and see what happens.
> I'm late for work though, so it may not be done this morning.
>
> Dave
Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

johnmci
In reply to this post by David T. Lewis
 

On Jun 12, 2007, at 4:15 AM, David T. Lewis wrote:

> On the 64-bit system, it's not allowing me to use anything that high
> in the address space. I can request a mmap to 0xfff00000000 and the
> request will be honored:


I was just reviewing these emails for michael and I don't think I  
commented on this
allocating at 0xfff00000000 would be fine to test the squeak oops  
space over the 0x8000000000000000 boundary.

It wasn't clear if you were able to allocate below the  
0x7FFFFFFFFFFFFFFF  boundary to test allocating over
that boundary.   Also I've never heard if any one has been able to  
run a Squeak image at say at or near 4GB

========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===


Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

Ian Piumarta
 

>> On the 64-bit system, it's not allowing me to use anything that high
>> in the address space. I can request a mmap to 0xfff00000000 and the
>> request will be honored:
>
> I was just reviewing these emails for michael and I don't think I  
> commented on this
> allocating at 0xfff00000000 would be fine to test the squeak oops  
> space over the 0x8000000000000000 boundary.
>
> It wasn't clear if you were able to allocate below the  
> 0x7FFFFFFFFFFFFFFF  boundary to test allocating over
> that boundary.   Also I've never heard if any one has been able to  
> run a Squeak image at say at or near 4GB

Dave: if you look in sqUnixMemory.c you'll see a facility in the  
memory allocator to artificially skew all oops by a certain amount.  
If you calculate the amount after the allocation you can place the  
apparent start of memory at any address you like.  There is (or was)  
some corresponding code in sqMemory.h to allow unskew all oops  
accordingly to bring them back into the allocated memory, but I've no  
idea if anyone removed that since (or even if it didn't make it out  
of the 64-bit prototype sources and into the final repository  
version).  This would seem by far the easiest way to force oops (as  
seen by the Interpreter) to occupy interesting borderline address  
ranges.

Cheers,
Ian


Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

David T. Lewis
 
John,

No, I did not test the 0x7FFFFFFFFFFFFFFF boundary at all. Is it important
to do so? If so, I'll see if I can set up a test based on Ian's tip.

Ian, are you referring to the SQ_FAKE_MEMORY_OFFSET macro? It looks like
that would do what you are suggesting.

Avi, have you been running the VM with these changes in any production
situations? If so, any feedback you might be able to provide would be
appreciated. Thanks!

Dave

On Sat, Jul 14, 2007 at 11:39:26AM -0700, Ian Piumarta wrote:

> >>On the 64-bit system, it's not allowing me to use anything that high
> >>in the address space. I can request a mmap to 0xfff00000000 and the
> >>request will be honored:
> >
> >I was just reviewing these emails for michael and I don't think I  
> >commented on this
> >allocating at 0xfff00000000 would be fine to test the squeak oops  
> >space over the 0x8000000000000000 boundary.
> >
> >It wasn't clear if you were able to allocate below the  
> >0x7FFFFFFFFFFFFFFF  boundary to test allocating over
> >that boundary.   Also I've never heard if any one has been able to  
> >run a Squeak image at say at or near 4GB
>
> Dave: if you look in sqUnixMemory.c you'll see a facility in the  
> memory allocator to artificially skew all oops by a certain amount.  
> If you calculate the amount after the allocation you can place the  
> apparent start of memory at any address you like.  There is (or was)  
> some corresponding code in sqMemory.h to allow unskew all oops  
> accordingly to bring them back into the allocated memory, but I've no  
> idea if anyone removed that since (or even if it didn't make it out  
> of the 64-bit prototype sources and into the final repository  
> version).  This would seem by far the easiest way to force oops (as  
> seen by the Interpreter) to occupy interesting borderline address  
> ranges.
>
> Cheers,
> Ian
>
Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

Andreas.Raab
 
David T. Lewis wrote:
> Avi, have you been running the VM with these changes in any production
> situations? If so, any feedback you might be able to provide would be
> appreciated. Thanks!

We've been running VMs with these changes[*] and they have fixed the
Linux problems that we had. As an aside, one thing that we ran into (and
that I just fixed a couple of days ago) is the effect that Delay in
Squeak is not safe. Much of the manipulation of the Delay internal
structures (SuspendedDelay and friends) happens from the calling process
and if that calling process gets killed things go south quickly.
Unfortunately, you won't ever run into this unless you run a server
(because you won't ever get "truly asynchronous" interrupts to cause
this to happen) which makes it all but impossible to recreate this
problem on a single machine.

[*] Yours plus John's IGC fix which turned out to be important.

Cheers,
   - Andreas
Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

johnmci
In reply to this post by David T. Lewis
 

On Jul 14, 2007, at 2:40 PM, David T. Lewis wrote:

> John,
>
> No, I did not test the 0x7FFFFFFFFFFFFFFF boundary at all. Is it  
> important
> to do so? If so, I'll see if I can set up a test based on Ian's tip.


Well that boundary is the magic positive versus negative signed 64  
bit integer value.
It's really just a cross check to confirm everything works as  
expected in the 64bit version.

it would appear that you've done below that value, and above the  
value with your 0xFF testing,
it's just the crossing of that value, to dot our I's and cross our  
t's so to speak.


>
> Ian, are you referring to the SQ_FAKE_MEMORY_OFFSET macro? It looks  
> like
> that would do what you are suggesting.
>
> Avi, have you been running the VM with these changes in any production
> situations? If so, any feedback you might be able to provide would be
> appreciated. Thanks!
>
> Dave
--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===


Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

johnmci
In reply to this post by Andreas.Raab
 

On Jul 14, 2007, at 2:48 PM, Andreas Raab wrote:

>
> We've been running VMs with these changes[*] and they have fixed  
> the Linux problems that we had. As an aside, one thing that we ran  
> into (and that I just fixed a couple of days ago

So you have these changes where?  I was not clear on your comment  
about server versus desktop and how the issue is triggered.  Do dual  
processor intel desktop machines count as Server machines?
Or is it the application mix? Single user application, versus MC or  
Seaside server.


--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===


Reply | Threaded
Open this post in threaded view
|

Re: 32bit clean VM work.

Andreas.Raab
 
John M McIntosh wrote:
> On Jul 14, 2007, at 2:48 PM, Andreas Raab wrote:
>> We've been running VMs with these changes[*] and they have fixed the
>> Linux problems that we had. As an aside, one thing that we ran into
>> (and that I just fixed a couple of days ago
>
> So you have these changes where?  I was not clear on your comment about
> server versus desktop and how the issue is triggered.  Do dual processor
> intel desktop machines count as Server machines?

Can't say for sure. Only that our server's MTBF was somewhere between
24-48 hours because of that problem. After deploying the fix we've been
going for three days straight with no problems (fingers crossed). If we
can make it to a week or so I'll post the changes since deploying them
on such short notice was a somewhat desperate measure due to heavy
customer complaints.

If you want to look at some code, the problematic places are pretty
obvious: Delay>>schedule, Delay>>unschedule, and Delay>>activate are all
prone to being terminated while updating Delay-internal structures. When
that happens, the result is a total system lockup since Delay resources
are globally shared. Also, note that these operations run with the
client's priority which makes it very possible to be preempted by a
higher priority process and cause other problems. For example, consider
a low priority process holding the Delay lock and a medium priority
process sitting in a tight loop for some reason; this will lock up the
entire system since the timer interrupt watcher won't be able to enter
the semaphore. I have a a couple of stack traces showing these and
related problems.

The one saving grace for us was to have USR1 generate a full stack dump
of all processes for forensic reasons. Without that we'd be using Java
on the servers by now (no kidding; this is still an option and depends
largely on whether we can make Squeak reliable enough as a server).

Cheers,
   - Andreas