Interrupted system call?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Interrupted system call?

Andreas.Raab
 
Folks -

I've seen references to that particular error in the Unix VM in other
discussions. We have this problem right now in a slightly different
context (an external library call reports that error) and I am wondering
if someone can explain to me what causes this error and whether there is
a way to "fix" the Unix VM not to cause it.

Thanks for any insights!

Cheers,
   - Andreas
Reply | Threaded
Open this post in threaded view
|

Re: Interrupted system call?

Ian Piumarta
 
Hi Andreas,

On Feb 10, 2009, at 5:48 PM, Andreas Raab wrote:

> I've seen references to that particular error in the Unix VM in  
> other discussions. We have this problem right now in a slightly  
> different context (an external library call reports that error) and  
> I am wondering if someone can explain to me what causes this error

The process is in a kernel system call when a lack of resource in  
blocking i/o (or a high-priority asychronous event) causes the  
process to be suspended.  A decent OS would save the process state  
reflecting its being halfway through the syscall such that at  
resumption it would continue the syscall, transparently to the user.  
Saving this state, in kernel mode and intermediate between two valid  
user-mode states, is very hard.  (Think about what would be needed if  
an asychonous signal arrived for a process suspended halfway through  
a syscall, for example.)  Unix, being particularly pragmatic but not  
particularly decent, choses instead to abort the syscall but to act  
as if it was completed (the user process resumes after the point of  
the call, not at it) but with a failure code (EINTR) to indicate the  
call was aborted.  The caller (the user's program) is expected to  
deal with this by restarting the syscall explicitly, with  
(presumably) identical arguments.

Google for "pc losering problem" (with the quotes) if you need more  
on this.

> and whether there is a way to "fix" the Unix VM not to cause it.

You might be able to drastically reduce the number of asychronous  
signals (and hence the likelihood of an interrupted syscall) by  
counting milliseconds with gettimeofday() instead of with periodic  
timer interrupts.  '-notimer' (SQUEAK_NOTIMER) is the option  
(environment variable), IIRC.

The pragmatically correct way to deal with this is to wrap each and  
every syscall in a Unix program in sometime like this:

while (EINTR == (err= syscall(whatever, ...)));
if (err) { deal with it }

The philosophically correct way to deal with it is to use an OS that  
isn't Unix.

> Thanks for any insights!

I'm not sure that the above was insightful, but I hope it was  
explanatory.

Cheers,
Ian

Reply | Threaded
Open this post in threaded view
|

Re: Interrupted system call?

Andreas.Raab
 
Hi Ian -

There is obviously a lot I don't understand about interrupt handling on
Unix since your description (and the stuff that I found looking for
PC-losering problem) don't make much sense to me ;-)

If I understand you correctly, then the program is in system call, then
an interrupt happens and for some unexplicable reason that means the OS
has to back out of the system call. Why is that? Wouldn't it be more
sensible to just delay delivering the interrupt up to the point where
the syscall returns? Yes, it doesn't guarantee real-time response but
then there is probably more than one process running at any given time
anyway so I wouldn't expect interrupts to be delivered real-time to user
land anyway. And I *really* can't fathom the thought that any interrupt
that happens for a process within a syscall somehow auto-magically leads
to the kernel to forgetting the state associated with the call ;-)

But regardless of the above, I guess the point here is that this is
really a buggy library if it doesn't wrap each and every syscall into
such a test, no? Is the Unix VM generally doing this? Are there
mitigating factors where you can be pretty sure it won't happen or
particularly bad things (for example having syscalls that take several
milliseconds with an itimer interrupt set to 1ms resolution or so?). Do
you know if heavy network activity affects this behavior?

Thanks for all the info!

Cheers,
   - Andreas

Ian Piumarta wrote:

> Hi Andreas,
>
> On Feb 10, 2009, at 5:48 PM, Andreas Raab wrote:
>
>> I've seen references to that particular error in the Unix VM in other
>> discussions. We have this problem right now in a slightly different
>> context (an external library call reports that error) and I am
>> wondering if someone can explain to me what causes this error
>
> The process is in a kernel system call when a lack of resource in
> blocking i/o (or a high-priority asychronous event) causes the process
> to be suspended.  A decent OS would save the process state reflecting
> its being halfway through the syscall such that at resumption it would
> continue the syscall, transparently to the user.  Saving this state, in
> kernel mode and intermediate between two valid user-mode states, is very
> hard.  (Think about what would be needed if an asychonous signal arrived
> for a process suspended halfway through a syscall, for example.)  Unix,
> being particularly pragmatic but not particularly decent, choses instead
> to abort the syscall but to act as if it was completed (the user process
> resumes after the point of the call, not at it) but with a failure code
> (EINTR) to indicate the call was aborted.  The caller (the user's
> program) is expected to deal with this by restarting the syscall
> explicitly, with (presumably) identical arguments.
>
> Google for "pc losering problem" (with the quotes) if you need more on
> this.
>
>> and whether there is a way to "fix" the Unix VM not to cause it.
>
> You might be able to drastically reduce the number of asychronous
> signals (and hence the likelihood of an interrupted syscall) by counting
> milliseconds with gettimeofday() instead of with periodic timer
> interrupts.  '-notimer' (SQUEAK_NOTIMER) is the option (environment
> variable), IIRC.
>
> The pragmatically correct way to deal with this is to wrap each and
> every syscall in a Unix program in sometime like this:
>
> while (EINTR == (err= syscall(whatever, ...)));
> if (err) { deal with it }
>
> The philosophically correct way to deal with it is to use an OS that
> isn't Unix.
>
>> Thanks for any insights!
>
> I'm not sure that the above was insightful, but I hope it was explanatory.
>
> Cheers,
> Ian
>
Reply | Threaded
Open this post in threaded view
|

Re: Interrupted system call?

johnmci
 
I've noted in the past the socket code lacks a few check for EINTR,  
I've a revised one here if anyone ones to look at it. Boring review of  
each syscall and determining if the man page says EINTR is a valid  
error code, then writing the retry logic.


On 10-Feb-09, at 8:11 PM, Andreas Raab wrote:

> But regardless of the above, I guess the point here is that this is  
> really a buggy library if it doesn't wrap each and every syscall  
> into such a test, no? Is the Unix VM generally doing this? Are there  
> mitigating factors where you can be pretty sure it won't happen or  
> particularly bad things (for example having syscalls that take  
> several milliseconds with an itimer interrupt set to 1ms resolution  
> or so?). Do you know if heavy network activity affects this behavior?

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
=
=
=
========================================================================



Reply | Threaded
Open this post in threaded view
|

Re: Interrupted system call?

Andrew Gaylard
In reply to this post by Ian Piumarta
 
On Wed, Feb 11, 2009 at 4:38 AM, Ian Piumarta <[hidden email]> wrote:

> The pragmatically correct way to deal with this is to wrap each and every
> syscall in a Unix program in sometime like this:
>
> while (EINTR == (err= syscall(whatever, ...)));
> if (err) { deal with it }
>
> The philosophically correct way to deal with it is to use an OS that isn't
> Unix.

Another pragmatically correct way to deal with it is to make use of
restartable syscalls, so that when signals arrive, the handler is called,
and when it completes the syscall resumes.  This is done when installing
the signal handler,  by using sigaction instead of signal, and by setting
the SA_RESTART flag.

As luck would have it, I'm busy investigating changing squeak to this
model at the moment, having had problems with platforms which
clear the handler after the first signal, and require to be reinstalled.

Andrew
Reply | Threaded
Open this post in threaded view
|

Re: Interrupted system call?

Andrew Gaylard
In reply to this post by Andreas.Raab
 
Ian: I assume you mean this:

http://www.stanford.edu/~stinson/cs240/cs240_1/WIB.txt

Thanks -- it was an interesting read!

On Wed, Feb 11, 2009 at 6:11 AM, Andreas Raab <[hidden email]> wrote:

> There is obviously a lot I don't understand about interrupt handling on Unix
> since your description (and the stuff that I found looking for PC-losering
> problem) don't make much sense to me ;-)
>
> If I understand you correctly, then the program is in system call, then an
> interrupt happens and for some unexplicable reason that means the OS has to
> back out of the system call. Why is that? Wouldn't it be more sensible to
> just delay delivering the interrupt up to the point where the syscall
> returns? Yes, it doesn't guarantee real-time response but then there is
> probably more than one process running at any given time anyway so I
> wouldn't expect interrupts to be delivered real-time to user land anyway.
> And I *really* can't fathom the thought that any interrupt that happens for
> a process within a syscall somehow auto-magically leads to the kernel to
> forgetting the state associated with the call ;-)

Andreas: I'll try to clarify this for you.  The situation is this:

a. a user-level process makes a kernel call which might take a while,
typically for I/O.

b. an interrupt arrives and is delivered to the user-level process.  In
the Unix world, this is a software interrupt, and is called a "signal";
hardware interrupts are handled by the kernel and are not visible
to user-level processes.

c.  when delivering the signal to the user-level process, the kernel
needs to make a choice: should it
(1) call the signal handler and, when it returns, then cancel the I/O
operation and return an error (in this case, EINTR)?; or
(2) call the signal handler and, when it returns, then restart/resume
the I/O operation?

d. when the signal handler returns, should the kernel
(3) leave it up to the user-level process to re-instate it?; or
(4) re-instate the signal-handler itself?

All Unices with signal handling support options (1) and (3).
Modern Unices (since at least 12 years ago) also support options
(2) and (4).

>From what I can tell, Squeak assumes a bit of both models.
- it does not specify that system calls should be restarted.
- it does not re-instate the handler (i.e. it assumes that kernel will).
- it does not in every case do the manual check for EINTR and restart,
as John mentioned in his post.

My approach is to fix the former two, thus avoiding having to fix the
latter.  It should be simple, given that there are only a few calls to
signal() in the codebase: sqUnixMain.c, aio.c, UnixOSProcessPlugin.c.
It should be possible to replace each signal() call with sigaction() with
SA_RESTART, fixing both the syscall restarting problem and the
handler-reinstating problem.  This way we don't have to go through
the entire codebase looking for IO operations and checking for EINTR,
restarting, etc.  And we don't have to remember to add it into any
new code we might write in the future.

This whole issue is complicated further in that signal() on certain Unices
(including FreeBSD, Linux, and MacOS) will restart syscalls automatically,
and certain Unices won't (SYS-V,  including Solaris and -- I think -- HPUX).
And certain Unices will re-instate the handler automatically, and others
won't.  So the existing code may be broken yet appear to work, if you
happen to be on the "right" platform.

Another thing: signals may arrive from external processes (e.g. the kill
command) or from squeak itself.  aio.c asks the kernel to notify it when
there is I/O available for reading/writing/etc. When there's I/O ready,
the kernel sends SIGIO; when SIGIO arrives, Squeak jumps to
forceInterruptCheck.

- Andrew
Reply | Threaded
Open this post in threaded view
|

Re: Interrupted system call?

Ian Piumarta
 
On Feb 11, 2009, at 1:56 AM, Andrew Gaylard wrote:

> Ian: I assume you mean this:
> http://www.stanford.edu/~stinson/cs240/cs240_1/WIB.txt

http://inwap.com/pdp10/pclsr.txt
http://en.wikipedia.org/wiki/PCLSRing

> - it does not specify that system calls should be restarted.
> - it does not re-instate the handler (i.e. it assumes that kernel  
> will).
> - it does not in every case do the manual check for EINTR and restart,
> as John mentioned in his post.
>
> My approach is to fix the former two, thus avoiding having to fix the
> latter.

Setting SA_RESTART does not guarantee that the kernel can (or even  
should) restart the call transparently; e.g., select() and poll()  
return EINTR even if the intervening signal has the restart flag  
set.  Section 10.5 of Advanced Programming in the UNIX Environment  
(Stevens and Rago) explains quantitatively some issues in handling  
EINTR, has a table of the behaviours supported among a few flavours  
of Unix, and mentions a few inconsistencies.  The sections on context  
switching and interrupts in The Design of the Unix Operating System  
(Bach) explain qualitatively the pragmatic choice in Unix to always  
save the post-syscall user mode PC rather than trying to roll back  
partially-completed calls when necessary.

Cheers,
Ian

Reply | Threaded
Open this post in threaded view
|

Re: Interrupted system call?

Andrew Gaylard
 
On Wed, Feb 11, 2009 at 9:43 PM, Ian Piumarta <[hidden email]> wrote:

>
> On Feb 11, 2009, at 1:56 AM, Andrew Gaylard wrote:
>
>> - it does not specify that system calls should be restarted.
>> - it does not re-instate the handler (i.e. it assumes that kernel will).
>> - it does not in every case do the manual check for EINTR and restart,
>> as John mentioned in his post.
>>
>> My approach is to fix the former two, thus avoiding having to fix the
>> latter.
>
> Setting SA_RESTART does not guarantee that the kernel can (or even should)
> restart the call transparently; e.g., select() and poll() return EINTR even
> if the intervening signal has the restart flag set.  Section 10.5 of
> Advanced Programming in the UNIX Environment (Stevens and Rago) explains
> quantitatively some issues in handling EINTR, has a table of the behaviours
> supported among a few flavours of Unix, and mentions a few inconsistencies.

You're absolutely right.  Section 5.9 of Unix Network Programming (Stevens,
Fenner, & Rudoff) is even more explicit:

*Some* kernels automatically restart *some* interrupted system calls.
For portability, when we write a program that catches signals (most
concurrent servers catch SIGCHLD), we must be prepared for slow
system calls to return EINTR.

(Their emphasis, not mine). So we still have to check each time for EINTR
and loop if necessary.

Sorry about that.

- Andrew