Squeak periodic crash

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Squeak periodic crash

squeakdev
I am seeing Squeak crash (segment fault) every four and a half days.
(actually 4.586041667 days,  110.065 hours, or 396234 seconds, give or
take a few seconds).

Doesn't matter if the image is idling or busy (mostly SMTP and Web
serving).  This has been happening for many months, fully repeatable,
on 2 different machines. I've basically lived with it (and it's
somewhat dampening effect on my confidence in Squeak for mission
critical apps) because I haven't the slightest idea why it's
happening, but after my last build, it has gone from 4.58 days to only
1.74 days (albeit based on only one sample so far).

The image is 3.7 vintage, 24M given -memory 30M, recent vm (exact
versions or any other details of interest on request), running on
Debian on a XEN or UML virtual node. (I'm currently trying it on a
real box to see if virtualization could somehow be implicated).

I can't begin to fathom what event could possibly be happening in the
VM that would be tied to this particular time interval, or anything
special about the value itself.  (I imagine that GC is the code that's
running when it crashes). Any thoughts?

Reply | Threaded
Open this post in threaded view
|

Re: Squeak periodic crash

Andreas.Raab
[hidden email] wrote:
> I can't begin to fathom what event could possibly be happening in the
> VM that would be tied to this particular time interval, or anything
> special about the value itself.  (I imagine that GC is the code that's
> running when it crashes). Any thoughts?

The main thought is that you shouldn't start pointing fingers unless you
have at least _some_ evidence supporting your claims. It's easy to claim
that it's caused by GC, or the network subsystem, or the timer code, or
the OS signal handling or any number of random reasons if you have no
idea what is going on. Things to do:
- Tell us more about what you are actually running. Is this is a stock
3.7 VM and image? If not what packages have you loaded? Do you have
specific dependencies on external (non-squeak) packages? Dependencies on
external C libraries that could cause memory corruption?
- Run the VM under gdb and let it crash. Try to investigate from there,
in particular try to print the call stacks (don't remember what the
magic invocation is). The VM implements both, printCallStack() to print
the active call stack and printAllStacks() which prints all the call
stacks (but I'm not sure which of those is supported in 3.7).

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Squeak periodic crash

squeakdev
In reply to this post by squeakdev
> - Run the VM under gdb and let it crash. Try to investigate from there,
> in particular try to print the call stacks (don't remember what the
> magic invocation is). The VM implements both, printCallStack() to print
> the active call stack and printAllStacks() which prints all the call
> stacks (but I'm not sure which of those is supported in 3.7).

It crashed in gdb but after only ~7 hours (so it's not clear this is
the same issue). But here is the stack:

Program received signal SIGPIPE, Broken pipe.
0xb7e8f2ce in __write_nocancel () from /lib/tls/libc.so.6

gdb>backtrace

#0  0xb7e8f2ce in __write_nocancel () from /lib/tls/libc.so.6
#1  0x080e8936 in sqSocketSendDataBufCount (s=0xb5aaa624,
    buf=0xb5b3c3f4 "Content-Type:
image/gif;\r\n\tname=\"orprxh.gif\"\r\nContent-ID:
<orprxh.gif@2D967531.ECA1AA47>\r\nContent-Transfer-Encoding:
base64\r\n\r\nR0lGODlh+gHYAbMAAAAADoQAAACCAAAAfNNhDXyOxvj56/ALCgcN///2/w",
'A' <repeats 15 times>..., bufSize=2077) at
/home/ajr/Squeak-3.9-7/platforms/unix/plugins/SocketPlugin/sqUnixSocket.c:1067
#2  0x080e661c in primitiveSocketSendDataBufCount ()
    at /home/ajr/Squeak-3.9-7/platforms/unix/src/vm/intplugins/SocketPlugin/SocketPlugin.c:1046
#3  0x0805bb77 in dispatchFunctionPointer (aFunctionPointer=0x80e64a0)
    at /home/ajr/Squeak-3.9-7/platforms/unix/src/vm/interp.c:3949
#4  0x08064305 in primitiveExternalCall () at
/home/ajr/Squeak-3.9-7/platforms/unix/src/vm/interp.c:14208
#5  0x0805bb77 in dispatchFunctionPointer (aFunctionPointer=0x8064230)
    at /home/ajr/Squeak-3.9-7/platforms/unix/src/vm/interp.c:3949
#6  0x0806d796 in interpret () at
/home/ajr/Squeak-3.9-7/platforms/unix/src/vm/interp.c:7756
#7  0x0805a5c9 in main (argc=1953394499, argv=0x0, envp=0x65707954)
    at /home/ajr/Squeak-3.9-7/platforms/unix/vm/sqUnixMain.c:1388

Reply | Threaded
Open this post in threaded view
|

Re: Squeak periodic crash

johnmci
Ok, assuming the C code is this 3.7 VM code below I'll note that the  
backtrace shows some interesting memory addresses
> (s=0xb5aaa624,
>    buf=0xb5b3c3f4 "Content-Type:
> image/gif;\r\n\tname=\"orprxh.gif\"\r\nContent-ID:
> <orprxh.gif@2D967531.ECA1AA47>\r\nContent-Transfer-Encoding:
> base64\r\n\r\nR0lGODlh+gHYAbMAAAAADoQAAACCAAAAfNNhDXyOxvj56/
> ALCgcN///2/w",
> 'A' <repeats 15 times>..., bufSize=2077)

which implies that buf is 0xb5b3c3f4. However with a 3.7 VM when  
object memory goes over the 0x80000000
you are hosed because of signed versus unsigned arithmetic issues  
with in the VM. In fact currently I'll bet this is
still an issue with even a 3.8 VM because I can't say I've seen any  
proof there has been a systematic effort to ensure memory
address doesn't accidentally become signed integers somewhere in the  
VM or platform specific files. This might be
the reason for your failures, if the VM loads object memory above the  
0x80000000 I would have thought the VM would crash immediately, if  
below and then grows to expand over the limit then crashing would  
occcur later.

On the other hand I'm not sure why you would get the SIGPIPE failure  
and why that would cause the VM to crash in libc.
That sounds like an operating system problem you should google on.


int sqSocketSendDataBufCount(SocketPtr s, int buf, int bufSize)
{
   int nsent= 0;

   if (!socketValid(s))
     return -1;

   if (UDPSocketType == s->socketType)
     {
       /* --- UDP --- */
       FPRINTF((stderr, "UDP sendData(%d, %d)\n", SOCKET(s), bufSize));
       if ((nsent= sendto(SOCKET(s), (void *)buf, bufSize, 0,
                         (struct sockaddr *)&SOCKETPEER(s),
                         sizeof(SOCKETPEER(s)))) <= 0)
        {
          if (errno == EWOULDBLOCK) /* asynchronous write in progress */
            return 0;
          FPRINTF((stderr, "UDP send failed\n"));
          SOCKETERROR(s)= errno;
          return 0;
        }
     }
   else
     {
       /* --- TCP --- */
       FPRINTF((stderr, "TCP sendData(%d, %d)\n", SOCKET(s), bufSize));
       if ((nsent= write(SOCKET(s), (char *)buf, bufSize)) <= 0)
        {
          if ((nsent == -1) && (errno == EWOULDBLOCK))
            {
              FPRINTF((stderr, "TCP sendData(%d, %d) -> %d [blocked]",
                       SOCKET(s), bufSize, nsent));
              return 0;
            }
          else
            {
              /* error: most likely "connection closed by peer" */
              SOCKETSTATE(s)= OtherEndClosed;
              SOCKETERROR(s)= errno;
              FPRINTF((stderr, "TCP write failed -> %d", errno));
              return 0;
            }
        }
     }
   /* write completed synchronously */
   FPRINTF((stderr, "sendData(%d) done = %d\n", SOCKET(s), nsent));
   return nsent;
}


On 8-Nov-06, at 5:04 AM, [hidden email] wrote:

>> - Run the VM under gdb and let it crash. Try to investigate from  
>> there,
>> in particular try to print the call stacks (don't remember what the
>> magic invocation is). The VM implements both, printCallStack() to  
>> print
>> the active call stack and printAllStacks() which prints all the call
>> stacks (but I'm not sure which of those is supported in 3.7).
>
> It crashed in gdb but after only ~7 hours (so it's not clear this is
> the same issue). But here is the stack:
>
> Program received signal SIGPIPE, Broken pipe.
> 0xb7e8f2ce in __write_nocancel () from /lib/tls/libc.so.6
>
> gdb>backtrace
>
> #0  0xb7e8f2ce in __write_nocancel () from /lib/tls/libc.so.6
> #1  0x080e8936 in sqSocketSendDataBufCount (s=0xb5aaa624,
>    buf=0xb5b3c3f4 "Content-Type:
> image/gif;\r\n\tname=\"orprxh.gif\"\r\nContent-ID:
> <orprxh.gif@2D967531.ECA1AA47>\r\nContent-Transfer-Encoding:
> base64\r\n\r\nR0lGODlh+gHYAbMAAAAADoQAAACCAAAAfNNhDXyOxvj56/
> ALCgcN///2/w",
> 'A' <repeats 15 times>..., bufSize=2077) at
> /home/ajr/Squeak-3.9-7/platforms/unix/plugins/SocketPlugin/
> sqUnixSocket.c:1067
> #2  0x080e661c in primitiveSocketSendDataBufCount ()
>    at /home/ajr/Squeak-3.9-7/platforms/unix/src/vm/intplugins/
> SocketPlugin/SocketPlugin.c:1046
> #3  0x0805bb77 in dispatchFunctionPointer (aFunctionPointer=0x80e64a0)
>    at /home/ajr/Squeak-3.9-7/platforms/unix/src/vm/interp.c:3949
> #4  0x08064305 in primitiveExternalCall () at
> /home/ajr/Squeak-3.9-7/platforms/unix/src/vm/interp.c:14208
> #5  0x0805bb77 in dispatchFunctionPointer (aFunctionPointer=0x8064230)
>    at /home/ajr/Squeak-3.9-7/platforms/unix/src/vm/interp.c:3949
> #6  0x0806d796 in interpret () at
> /home/ajr/Squeak-3.9-7/platforms/unix/src/vm/interp.c:7756
> #7  0x0805a5c9 in main (argc=1953394499, argv=0x0, envp=0x65707954)
>    at /home/ajr/Squeak-3.9-7/platforms/unix/vm/sqUnixMain.c:1388
>

--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===