Tight compute loops -> no context switch

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Tight compute loops -> no context switch

Bill Dargel
The recent thread on Ctrl-Break inspired me to try out a few things. I
thought I had something, using a high-priority process that could be
used to interrupt the main user interface process.

But, what worked in D5 didn't in D6.

I did some more experimenting and came to the rather disturbing
conclusion that tight computations in D6 can also keep higher priority
Smalltalk processes from running.

This can be demonstrated with:
---
go := true.
log := OrderedCollection new.
[
        [Processor sleep: 1000.
        log add: Time now -> 'interrupt awake'.
        go] whileTrue.
] forkAt: Processor userInterruptPriority.

log add: Time now -> 'start loop'.
Time millisecondsToRun: [20000000 timesRepeat: [2.0+17/(3*2+42)]].
log add: Time now -> 'loop done'.
go := false.
log inspect.
---

Adjust the constant (starting small) to have the computation take a
reasonable amount of time, say around 10 seconds.

On D6 the higher priority process only runs after the tight computation
is done. On D5 it wakes up once each second (as it should). You can
change the computation in various ways that _will_ allow the process
context switching to occur. For example simply using "2 + 2.0" will
work, with its primitive failure and coercing of the SmallInteger into a
Float.

It's certainly not clear to me just what VM operations need to be
occurring during a computation so that process context switching will
happen. Can anyone shed some light? Would be good to guage how likely it
is that this situation might arise in an actual application. I'm also
curious if this is related to the issue of getting ctrl-break to respond.

A related observation: the times for the computation using D5 were
#(7334 7286 7211) while in D6 they were #(9439 9276 9301), or 28%
slower! Granted, this is just one data point and there may be good
reasons for the change, but still, I don't like the direction.

--
Bill Dargel            [hidden email]
Shoshana Technologies
100 West Joy Road, Ann Arbor, MI 48105  USA


Reply | Threaded
Open this post in threaded view
|

Re: Tight compute loops -> no context switch

Chris Uppal-3
Bill,


> I did some more experimenting and came to the rather disturbing
> conclusion that tight computations in D6 can also keep higher priority
> Smalltalk processes from running.


I see the same effect, including D6 being slower on this example than D5, on
WinXP.  Interresting...


> I'm also
> curious if this is related to the issue of getting ctrl-break to respond.

I think they must be connected -- the un-preemtable loop is not interruptible
by ctrl-break, but changing the calculation to make the it pre-emptable also
makes it interruptible.

Presumably the VM checks for ctrl+break at the same time as its re-scheduling
code runs.

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Tight compute loops -> no context switch

Blair-2
In reply to this post by Bill Dargel
"Bill Dargel" <[hidden email]> wrote in message
news:[hidden email]...
> The recent thread on Ctrl-Break inspired me to try out a few things. I
> thought I had something, using a high-priority process that could be used
> to interrupt the main user interface process.
>
> But, what worked in D5 didn't in D6.
>
> I did some more experimenting and came to the rather disturbing conclusion
> that tight computations in D6 can also keep higher priority Smalltalk
> processes from running.

Hi Bill, thanks for the report. Sorry for the delayed response. This issue
is due mainly to a compiler change that has made the bytecodes it generates
for some loops get out of sync with the VM's assumptions. In D5 the compiler
used to generate rather inefficient code for some loops (such as
#timesRepeat:) which contained a conditional jump at the start of the loop,
and an unconditional jump at the end. Both jumps were executed for every
loop iteration. In D6 the compiler uses the more efficient implementation of
jumping unconditionally to the end of the loop when the loop starts, and
having only a single conditional jump back on loop iterations. Unfortunately
the VM still assumes the old behaviour, and only tests for asynchronous
events (such as timer interrupts and Ctrl+Break) on certain unconditional
backwards jumps. This means that tight loops which do not contain any actual
message sends will be uninterruptable. It also means that tight loops which
contain relatively few message sends may take a very long time respond to
Ctrl+Break, since this is expensive to check for and so is sampled only
after executing a relatively large number of message sends/jumps. The actual
sampling interval will depend on the speed of your machine, but will
probably be around 2^20 on modern machines. The calculation of the sampling
interval was based on a trade off between overhead and responsiveness to
asynchronous events that was designed back in the early days of Dolphin. It
may no longer be appropriate.

> ...
> Adjust the constant (starting small) to have the computation take a
> reasonable amount of time, say around 10 seconds.
>
> On D6 the higher priority process only runs after the tight computation is
> done. On D5 it wakes up once each second (as it should). You can change
> the computation in various ways that _will_ allow the process context
> switching to occur. For example simply using "2 + 2.0" will work, with its
> primitive failure and coercing of the SmallInteger into a Float.

This is because a number of true message sends are then needed to perform
the computation. You should also find that Ctrl+Break will now work almost
immediately too.

>
> It's certainly not clear to me just what VM operations need to be
> occurring during a computation so that process context switching will
> happen. Can anyone shed some light?

Just ensure some actual (unoptimised) messages are sent. You can identify
optimised messages sends as they are labelled "special" in the disassembly.

>...Would be good to guage how likely it is that this situation might arise
>in an actual application.

Unlikely I would say, unless your application contains tight loops that do
nothing but arithmetic.

>...I'm also curious if this is related to the issue of getting ctrl-break
>to respond.
>

Yes it is.


> A related observation: the times for the computation using D5 were #(7334
> 7286 7211) while in D6 they were #(9439 9276 9301), or 28% slower!
> Granted, this is just one data point and there may be good reasons for the
> change, but still, I don't like the direction.

The speed of this loop is dominated entirely by the floating point
arithmetic, which in turn is dominated by allocation time. Allocation of
floats in a tight loop like this is slower in D6. Indeed you may find there
are some micro benchmarks on which D5 is faster. I'm not really concerned
about most of these, however, since these are mostly insignificant in
relation to overall application performance. On more macro benchmarks D6
should be faster, sometimes substantially so. Where it's not, I'm interested
to hear about it, but remember that overall performance is a balance and
sometimes trade offs are necessary.

Regards

Blair


Reply | Threaded
Open this post in threaded view
|

Re: Tight compute loops -> no context switch

Bill Dargel
Blair wrote:
> Hi Bill, thanks for the report.

Thanks for the explanation!

> ... The calculation of
> the sampling interval was based on a trade off between overhead and
> responsiveness to asynchronous events that was designed back in the
> early days of Dolphin. It may no longer be appropriate.

After reading this description of counting message sends and
unconditional backwards jumps as a way to estimate elapsed time, my
deja-vu o'meter started to ring. Here's a link to the end of a
discussion we had three years ago entitled "External calls and UI
responsiveness":
http://groups.google.com/group/comp.lang.smalltalk.dolphin/msg/1f6c18c0dc194fac

Back then, the cause of low message/jump counts was external calls,
instead of tight loops (with the erroneous VM assumption). And what was
being starved was checking of the windows message queue, instead of
checking for ctrl-break or the Delay timer event.

But the underlying problem sure seems the same -- using that
message/jump counter as an approximator of elapsed time.

IMNAWP (I am not a Windows programmer), but wouldn't it be possible to
have a windows thread wait on a periodic timer and set or increment a
memory location that the main VM process could then test, just like it
tests the message/jump count? Seems like the overhead for the main VM
process would be exactly the same as it is now. Perhaps even less, if it
no longer needed to maintain the message count.

--
Bill Dargel            [hidden email]
Shoshana Technologies
100 West Joy Road, Ann Arbor, MI 48105  USA


Reply | Threaded
Open this post in threaded view
|

Re: Tight compute loops -> no context switch

Eliot Miranda
Bill Dargel wrote:

> Blair wrote:
>
>> Hi Bill, thanks for the report.
>
>
> Thanks for the explanation!
>
>> ... The calculation of the sampling interval was based on a trade off
>> between overhead and responsiveness to asynchronous events that was
>> designed back in the early days of Dolphin. It may no longer be
>> appropriate.
>
>
> After reading this description of counting message sends and
> unconditional backwards jumps as a way to estimate elapsed time, my
> deja-vu o'meter started to ring. Here's a link to the end of a
> discussion we had three years ago entitled "External calls and UI
> responsiveness":
> http://groups.google.com/group/comp.lang.smalltalk.dolphin/msg/1f6c18c0dc194fac 
>
>
> Back then, the cause of low message/jump counts was external calls,
> instead of tight loops (with the erroneous VM assumption). And what was
> being starved was checking of the windows message queue, instead of
> checking for ctrl-break or the Delay timer event.
>
> But the underlying problem sure seems the same -- using that
> message/jump counter as an approximator of elapsed time.
>
> IMNAWP (I am not a Windows programmer), but wouldn't it be possible to
> have a windows thread wait on a periodic timer and set or increment a
> memory location that the main VM process could then test, just like it
> tests the message/jump count? Seems like the overhead for the main VM
> process would be exactly the same as it is now. Perhaps even less, if it
> no longer needed to maintain the message count.

Yes, it is possible.  This is what I did in VW.  Up through VW 3.0 we
had a counter decremented at the start of any method that built a frame
(i.e. no count for non-failing primitives or simple inst var accessors).
  The system would poll every 1024 counts (I think).  First this
produces a ridiculously high event poll frequency on fast machines
running normal Smalltalk code.  Second it wouldn't poll nearly often
enough if certain long-running primitives (i.e. large integer
arithmetic) are in use.  I remember in ~ 1997 having a test case that
took 30 seconds from typing control-C for the system to interrupt the
running process.

This approach was required for Win3.1 support which didn't have true
multi-threading.  So in ~1998/1999 I threw it out, added a high-priority
thread that looped waiting on a Windows semaphore with a 30 ms timeout
that caused the VM to poll at a steady 30 milliseconds unless it was
asleep.  This actually improved base compute performance on WinNT et al
by nearly 50% (i.e. nearly doubled performance), mainly by removing an
extremely expensive read-modify-write cycle on method activation.
--
_______________,,,^..^,,,____________________________
Eliot Miranda              Smalltalk - Scene not herd