Smalltalk › Squeak › Squeak VM

Interrupted system call?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

8 messages Options

Andreas.Raab

Interrupted system call?

Folks -

I've seen references to that particular error in the Unix VM in other
discussions. We have this problem right now in a slightly different
context (an external library call reports that error) and I am wondering
if someone can explain to me what causes this error and whether there is
a way to "fix" the Unix VM not to cause it.

Thanks for any insights!

Cheers,
- Andreas

Ian Piumarta

Re: Interrupted system call?

Hi Andreas,

On Feb 10, 2009, at 5:48 PM, Andreas Raab wrote:

> I've seen references to that particular error in the Unix VM in
> other discussions. We have this problem right now in a slightly
> different context (an external library call reports that error) and
> I am wondering if someone can explain to me what causes this error

The process is in a kernel system call when a lack of resource in
blocking i/o (or a high-priority asychronous event) causes the
process to be suspended. A decent OS would save the process state
reflecting its being halfway through the syscall such that at
resumption it would continue the syscall, transparently to the user.
Saving this state, in kernel mode and intermediate between two valid
user-mode states, is very hard. (Think about what would be needed if
an asychonous signal arrived for a process suspended halfway through
a syscall, for example.) Unix, being particularly pragmatic but not
particularly decent, choses instead to abort the syscall but to act
as if it was completed (the user process resumes after the point of
the call, not at it) but with a failure code (EINTR) to indicate the
call was aborted. The caller (the user's program) is expected to
deal with this by restarting the syscall explicitly, with
(presumably) identical arguments.

Google for "pc losering problem" (with the quotes) if you need more
on this.

> and whether there is a way to "fix" the Unix VM not to cause it.

You might be able to drastically reduce the number of asychronous
signals (and hence the likelihood of an interrupted syscall) by
counting milliseconds with gettimeofday() instead of with periodic
timer interrupts. '-notimer' (SQUEAK_NOTIMER) is the option
(environment variable), IIRC.

The pragmatically correct way to deal with this is to wrap each and
every syscall in a Unix program in sometime like this:

while (EINTR == (err= syscall(whatever, ...)));
if (err) { deal with it }

The philosophically correct way to deal with it is to use an OS that
isn't Unix.

> Thanks for any insights!

I'm not sure that the above was insightful, but I hope it was
explanatory.

Cheers,
Ian

Andreas.Raab

Re: Interrupted system call?

Hi Ian -

There is obviously a lot I don't understand about interrupt handling on
Unix since your description (and the stuff that I found looking for
PC-losering problem) don't make much sense to me ;-)

If I understand you correctly, then the program is in system call, then
an interrupt happens and for some unexplicable reason that means the OS
has to back out of the system call. Why is that? Wouldn't it be more
sensible to just delay delivering the interrupt up to the point where
the syscall returns? Yes, it doesn't guarantee real-time response but
then there is probably more than one process running at any given time
anyway so I wouldn't expect interrupts to be delivered real-time to user
land anyway. And I *really* can't fathom the thought that any interrupt
that happens for a process within a syscall somehow auto-magically leads
to the kernel to forgetting the state associated with the call ;-)

But regardless of the above, I guess the point here is that this is
really a buggy library if it doesn't wrap each and every syscall into
such a test, no? Is the Unix VM generally doing this? Are there
mitigating factors where you can be pretty sure it won't happen or
particularly bad things (for example having syscalls that take several
milliseconds with an itimer interrupt set to 1ms resolution or so?). Do
you know if heavy network activity affects this behavior?

Thanks for all the info!

Cheers,
- Andreas

Ian Piumarta wrote:

> Hi Andreas,
>
> On Feb 10, 2009, at 5:48 PM, Andreas Raab wrote:
>
>> I've seen references to that particular error in the Unix VM in other
>> discussions. We have this problem right now in a slightly different
>> context (an external library call reports that error) and I am
>> wondering if someone can explain to me what causes this error
>
> The process is in a kernel system call when a lack of resource in
> blocking i/o (or a high-priority asychronous event) causes the process
> to be suspended. A decent OS would save the process state reflecting
> its being halfway through the syscall such that at resumption it would
> continue the syscall, transparently to the user. Saving this state, in
> kernel mode and intermediate between two valid user-mode states, is very
> hard. (Think about what would be needed if an asychonous signal arrived
> for a process suspended halfway through a syscall, for example.) Unix,
> being particularly pragmatic but not particularly decent, choses instead
> to abort the syscall but to act as if it was completed (the user process
> resumes after the point of the call, not at it) but with a failure code
> (EINTR) to indicate the call was aborted. The caller (the user's
> program) is expected to deal with this by restarting the syscall
> explicitly, with (presumably) identical arguments.
>
> Google for "pc losering problem" (with the quotes) if you need more on
> this.
>
>> and whether there is a way to "fix" the Unix VM not to cause it.
>
> You might be able to drastically reduce the number of asychronous
> signals (and hence the likelihood of an interrupted syscall) by counting
> milliseconds with gettimeofday() instead of with periodic timer
> interrupts. '-notimer' (SQUEAK_NOTIMER) is the option (environment
> variable), IIRC.
>
> The pragmatically correct way to deal with this is to wrap each and
> every syscall in a Unix program in sometime like this:
>
> while (EINTR == (err= syscall(whatever, ...)));
> if (err) { deal with it }
>
> The philosophically correct way to deal with it is to use an OS that
> isn't Unix.
>
>> Thanks for any insights!
>
> I'm not sure that the above was insightful, but I hope it was explanatory.
>
> Cheers,
> Ian
>

johnmci

Re: Interrupted system call?

I've noted in the past the socket code lacks a few check for EINTR,
I've a revised one here if anyone ones to look at it. Boring review of
each syscall and determining if the man page says EINTR is a valid
error code, then writing the retry logic.

On 10-Feb-09, at 8:11 PM, Andreas Raab wrote:

> But regardless of the above, I guess the point here is that this is
> really a buggy library if it doesn't wrap each and every syscall
> into such a test, no? Is the Unix VM generally doing this? Are there
> mitigating factors where you can be pretty sure it won't happen or
> particularly bad things (for example having syscalls that take
> several milliseconds with an itimer interrupt set to 1ms resolution
> or so?). Do you know if heavy network activity affects this behavior?

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

Andrew Gaylard

Re: Interrupted system call?

In reply to this post by Ian Piumarta

On Wed, Feb 11, 2009 at 4:38 AM, Ian Piumarta <[hidden email]> wrote:

> The pragmatically correct way to deal with this is to wrap each and every
> syscall in a Unix program in sometime like this:
>
> while (EINTR == (err= syscall(whatever, ...)));
> if (err) { deal with it }
>
> The philosophically correct way to deal with it is to use an OS that isn't
> Unix.

Another pragmatically correct way to deal with it is to make use of
restartable syscalls, so that when signals arrive, the handler is called,
and when it completes the syscall resumes. This is done when installing
the signal handler, by using sigaction instead of signal, and by setting
the SA_RESTART flag.

As luck would have it, I'm busy investigating changing squeak to this
model at the moment, having had problems with platforms which
clear the handler after the first signal, and require to be reinstalled.

Andrew

Andrew Gaylard

Re: Interrupted system call?

In reply to this post by Andreas.Raab

Ian: I assume you mean this:

http://www.stanford.edu/~stinson/cs240/cs240_1/WIB.txt

Thanks -- it was an interesting read!

On Wed, Feb 11, 2009 at 6:11 AM, Andreas Raab <[hidden email]> wrote:

> There is obviously a lot I don't understand about interrupt handling on Unix
> since your description (and the stuff that I found looking for PC-losering
> problem) don't make much sense to me ;-)
>
> If I understand you correctly, then the program is in system call, then an
> interrupt happens and for some unexplicable reason that means the OS has to
> back out of the system call. Why is that? Wouldn't it be more sensible to
> just delay delivering the interrupt up to the point where the syscall
> returns? Yes, it doesn't guarantee real-time response but then there is
> probably more than one process running at any given time anyway so I
> wouldn't expect interrupts to be delivered real-time to user land anyway.
> And I *really* can't fathom the thought that any interrupt that happens for
> a process within a syscall somehow auto-magically leads to the kernel to
> forgetting the state associated with the call ;-)

Andreas: I'll try to clarify this for you. The situation is this:

a. a user-level process makes a kernel call which might take a while,
typically for I/O.

b. an interrupt arrives and is delivered to the user-level process. In
the Unix world, this is a software interrupt, and is called a "signal";
hardware interrupts are handled by the kernel and are not visible
to user-level processes.

c. when delivering the signal to the user-level process, the kernel
needs to make a choice: should it
(1) call the signal handler and, when it returns, then cancel the I/O
operation and return an error (in this case, EINTR)?; or
(2) call the signal handler and, when it returns, then restart/resume
the I/O operation?

d. when the signal handler returns, should the kernel
(3) leave it up to the user-level process to re-instate it?; or
(4) re-instate the signal-handler itself?

All Unices with signal handling support options (1) and (3).
Modern Unices (since at least 12 years ago) also support options
(2) and (4).

>From what I can tell, Squeak assumes a bit of both models.
- it does not specify that system calls should be restarted.
- it does not re-instate the handler (i.e. it assumes that kernel will).
- it does not in every case do the manual check for EINTR and restart,
as John mentioned in his post.

My approach is to fix the former two, thus avoiding having to fix the
latter. It should be simple, given that there are only a few calls to
signal() in the codebase: sqUnixMain.c, aio.c, UnixOSProcessPlugin.c.
It should be possible to replace each signal() call with sigaction() with
SA_RESTART, fixing both the syscall restarting problem and the
handler-reinstating problem. This way we don't have to go through
the entire codebase looking for IO operations and checking for EINTR,
restarting, etc. And we don't have to remember to add it into any
new code we might write in the future.

This whole issue is complicated further in that signal() on certain Unices
(including FreeBSD, Linux, and MacOS) will restart syscalls automatically,
and certain Unices won't (SYS-V, including Solaris and -- I think -- HPUX).
And certain Unices will re-instate the handler automatically, and others
won't. So the existing code may be broken yet appear to work, if you
happen to be on the "right" platform.

Another thing: signals may arrive from external processes (e.g. the kill
command) or from squeak itself. aio.c asks the kernel to notify it when
there is I/O available for reading/writing/etc. When there's I/O ready,
the kernel sends SIGIO; when SIGIO arrives, Squeak jumps to
forceInterruptCheck.

- Andrew

Ian Piumarta

Re: Interrupted system call?

On Feb 11, 2009, at 1:56 AM, Andrew Gaylard wrote:

> Ian: I assume you mean this:
> http://www.stanford.edu/~stinson/cs240/cs240_1/WIB.txt

http://inwap.com/pdp10/pclsr.txt
http://en.wikipedia.org/wiki/PCLSRing

> - it does not specify that system calls should be restarted.
> - it does not re-instate the handler (i.e. it assumes that kernel
> will).
> - it does not in every case do the manual check for EINTR and restart,
> as John mentioned in his post.
>
> My approach is to fix the former two, thus avoiding having to fix the
> latter.

Setting SA_RESTART does not guarantee that the kernel can (or even
should) restart the call transparently; e.g., select() and poll()
return EINTR even if the intervening signal has the restart flag
set. Section 10.5 of Advanced Programming in the UNIX Environment
(Stevens and Rago) explains quantitatively some issues in handling
EINTR, has a table of the behaviours supported among a few flavours
of Unix, and mentions a few inconsistencies. The sections on context
switching and interrupts in The Design of the Unix Operating System
(Bach) explain qualitatively the pragmatic choice in Unix to always
save the post-syscall user mode PC rather than trying to roll back
partially-completed calls when necessary.

Cheers,
Ian

Andrew Gaylard

Re: Interrupted system call?

On Wed, Feb 11, 2009 at 9:43 PM, Ian Piumarta <[hidden email]> wrote:

>
> On Feb 11, 2009, at 1:56 AM, Andrew Gaylard wrote:
>
>> - it does not specify that system calls should be restarted.
>> - it does not re-instate the handler (i.e. it assumes that kernel will).
>> - it does not in every case do the manual check for EINTR and restart,
>> as John mentioned in his post.
>>
>> My approach is to fix the former two, thus avoiding having to fix the
>> latter.
>
> Setting SA_RESTART does not guarantee that the kernel can (or even should)
> restart the call transparently; e.g., select() and poll() return EINTR even
> if the intervening signal has the restart flag set. Section 10.5 of
> Advanced Programming in the UNIX Environment (Stevens and Rago) explains
> quantitatively some issues in handling EINTR, has a table of the behaviours
> supported among a few flavours of Unix, and mentions a few inconsistencies.

You're absolutely right. Section 5.9 of Unix Network Programming (Stevens,
Fenner, & Rudoff) is even more explicit:

*Some* kernels automatically restart *some* interrupted system calls.
For portability, when we write a program that catches signals (most
concurrent servers catch SIGCHLD), we must be prepared for slow
system calls to return EINTR.

(Their emphasis, not mine). So we still have to check each time for EINTR
and loop if necessary.

Sorry about that.

- Andrew