Curious context change latency experiment results (wrt DelayScheduler)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Curious context change latency experiment results (wrt DelayScheduler)

Ben Coman
One of the key parts of Delay scheduling is setting the resumption time.
There are two places this could be done.
a. In Delay>>#schedule, running (for example) at userSchedulingPriority=40
b. In DelayMicrosecondScheduler>>#handleTImerEvent:, running at timingPriority=80

When we were using the millisecond clock for delay scheduling,
it made sense to use (b.) since the clock might(?) roll over 
between when resumption time was set, and when #handleTImerEvent: expires delays.

This should not be a problem now that we are using the microsecond clock, 
so I wondered what the latency penalty might be between (a.) and (b.).
I ran a little experiment that indicates the cost/latency of switching threads,
and was curious if anyone can comment on the validity of the experiment 
and interpretation of results.

I made the following three changes...

DelayScheduler subclass: #DelayMicrosecondScheduler
instanceVariableNames: 'latencyStart countSinceLastZero'
classVariableNames: 'LatencyCounts'
poolDictionaries: ''
category: 'Kernel-Processes'


DelayMicrosecondScheduler>>#schedule: aDelay
latencyStart:= Time primUTCMicrosecondsClock. "This is position (a.)"
aDelay schedulerBeingWaitedOn ifTrue: [^self error: 'This Delay has already been scheduled.'].
accessProtect critical: [
scheduledDelay := aDelay.
timingSemaphore signal. "#handleTimerEvent: sets scheduledDelay:=nil"
].


DelayMicrosecondScheduler>>#handleTimerEvent: microsecondNowTick
| microsecondNextTick |
"Process any schedule requests"     "This is position (b.)"
scheduledDelay ifNotNil: [
|latency|
latency := Time primUTCMicrosecondsClock - latencyStart.
LatencyCounts ifNil: [  LatencyCounts := Bag new ].
LatencyCounts add: latency.
latency = 0 
ifTrue: [ countSinceLastZero := 1 + (countSinceLastZero ifNil: [0])]
ifFalse: [Transcript 
crShow: 'zero latency count ' , countSinceLastZero printString ;
show: ' before latency ', latency printString.
countSinceLastZero := 0].
"Schedule the given delay."
scheduledDelay scheduler: self resumptionTime: microsecondNowTick + (1000 * scheduledDelay millisecondDelayDuration).
self scheduleDelay: scheduledDelay.
scheduledDelay := nil ].
<snip>
rest of method unchanged


Then opened the Transcript and in Playground evaluated...
Delay delaySchedulerClass: DelayMicrosecondScheduler.


The Transcript results are shown below with some comments inserted.

Now I guess the latency is affected by garbage collection. 
But one thing I was curious about is why the latency's were quantised in 1000s.  

Another interesting thing is that vast majority of the latency's were zero,
which was a nice surprise, but can it be true?  Or is it a consequence 
of the quantitisation rounding down?

It seems that the idle-ness of the image affected how often a non-zero latency occurred.
After I left the house for a while, the count of zero latency was very high, but a few still occurred.  It would make sense there was less GC while idle.  What is a good snippet of code to stress GC. I presume the latency might increase.


zero latency count 2273 before latency 1000
zero latency count 943 before latency 1000
zero latency count 3666 before latency 1000
zero latency count 1643 before latency 1000
zero latency count 27 before latency 1000
"Left house for 20 minutes"  
zero latency count 12022 before latency 1000
zero latency count 15195 before latency 1000
zero latency count 41998 before latency 1000
"Returned from outing"
zero latency count 128 before latency 1000
zero latency count 116 before latency 1000
zero latency count 555 before latency 1000
zero latency count 2377 before latency 1000
zero latency count 5423 before latency 1000
zero latency count 3178 before latency 1000
zero latency count 47 before latency 1000
zero latency count 2276 before latency 1000
"Left house to go shopping"
zero latency count 6708 before latency 3000
zero latency count 4896 before latency 1000
zero latency count 433 before latency 1000
zero latency count 7106 before latency 1000
zero latency count 2195 before latency 1000
zero latency count 12397 before latency 1000
zero latency count 4815 before latency 2000
zero latency count 3480 before latency 1000
zero latency count 5690 before latency 1000
zero latency count 1670 before latency 1000
zero latency count 320 before latency 1000
zero latency count 1641 before latency 2000
zero latency count 290 before latency 1000
zero latency count 3835 before latency 1000
zero latency count 3818 before latency 1000
zero latency count 1132 before latency 1000
"Returned from shopping."
zero latency count 2045 before latency 1000
zero latency count 671 before latency 1000
zero latency count 167 before latency 1000
zero latency count 147 before latency 1000

From the System Browser, inspecting the class variable LatencyCounts showed...
      0      171329
3000      1
2000      3
1000      49


cheers -ben

Reply | Threaded
Open this post in threaded view
|

Re: Curious context change latency experiment results (wrt DelayScheduler)

Ben Coman


On 5 May 2018 at 22:10, Ben Coman <[hidden email]> wrote:
One of the key parts of Delay scheduling is setting the resumption time.
There are two places this could be done.
a. In Delay>>#schedule, running (for example) at userSchedulingPriority=40
b. In DelayMicrosecondScheduler>>#handleTImerEvent:, running at timingPriority=80

When we were using the millisecond clock for delay scheduling,
it made sense to use (b.) since the clock might(?) roll over 
between when resumption time was set, and when #handleTImerEvent: expires delays.

This should not be a problem now that we are using the microsecond clock, 
so I wondered what the latency penalty might be between (a.) and (b.).
I ran a little experiment that indicates the cost/latency of switching threads,
and was curious if anyone can comment on the validity of the experiment 
and interpretation of results.

I made the following three changes...

DelayScheduler subclass: #DelayMicrosecondScheduler
instanceVariableNames: 'latencyStart countSinceLastZero'
classVariableNames: 'LatencyCounts'
poolDictionaries: ''
category: 'Kernel-Processes'


DelayMicrosecondScheduler>>#schedule: aDelay
latencyStart:= Time primUTCMicrosecondsClock. "This is position (a.)"
aDelay schedulerBeingWaitedOn ifTrue: [^self error: 'This Delay has already been scheduled.'].
accessProtect critical: [
scheduledDelay := aDelay.
timingSemaphore signal. "#handleTimerEvent: sets scheduledDelay:=nil"
].


DelayMicrosecondScheduler>>#handleTimerEvent: microsecondNowTick
| microsecondNextTick |
"Process any schedule requests"     "This is position (b.)"
scheduledDelay ifNotNil: [
|latency|
latency := Time primUTCMicrosecondsClock - latencyStart.
LatencyCounts ifNil: [  LatencyCounts := Bag new ].
LatencyCounts add: latency.
latency = 0 
ifTrue: [ countSinceLastZero := 1 + (countSinceLastZero ifNil: [0])]
ifFalse: [Transcript 
crShow: 'zero latency count ' , countSinceLastZero printString ;
show: ' before latency ', latency printString.
countSinceLastZero := 0].
"Schedule the given delay."
scheduledDelay scheduler: self resumptionTime: microsecondNowTick + (1000 * scheduledDelay millisecondDelayDuration).
self scheduleDelay: scheduledDelay.
scheduledDelay := nil ].
<snip>
rest of method unchanged


Then opened the Transcript and in Playground evaluated...
Delay delaySchedulerClass: DelayMicrosecondScheduler.


The Transcript results are shown below with some comments inserted.

Now I guess the latency is affected by garbage collection. 
But one thing I was curious about is why the latency's were quantised in 1000s.  

Another interesting thing is that vast majority of the latency's were zero,
which was a nice surprise, but can it be true?  Or is it a consequence 
of the quantitisation rounding down?

It seems that the idle-ness of the image affected how often a non-zero latency occurred.
After I left the house for a while, the count of zero latency was very high, but a few still occurred.  It would make sense there was less GC while idle.  What is a good snippet of code to stress GC. I presume the latency might increase.


zero latency count 2273 before latency 1000
zero latency count 943 before latency 1000
zero latency count 3666 before latency 1000
zero latency count 1643 before latency 1000
zero latency count 27 before latency 1000
"Left house for 20 minutes"  
zero latency count 12022 before latency 1000
zero latency count 15195 before latency 1000
zero latency count 41998 before latency 1000
"Returned from outing"
zero latency count 128 before latency 1000
zero latency count 116 before latency 1000
zero latency count 555 before latency 1000
zero latency count 2377 before latency 1000
zero latency count 5423 before latency 1000
zero latency count 3178 before latency 1000
zero latency count 47 before latency 1000
zero latency count 2276 before latency 1000
"Left house to go shopping"
zero latency count 6708 before latency 3000
zero latency count 4896 before latency 1000
zero latency count 433 before latency 1000
zero latency count 7106 before latency 1000
zero latency count 2195 before latency 1000
zero latency count 12397 before latency 1000
zero latency count 4815 before latency 2000
zero latency count 3480 before latency 1000
zero latency count 5690 before latency 1000
zero latency count 1670 before latency 1000
zero latency count 320 before latency 1000
zero latency count 1641 before latency 2000
zero latency count 290 before latency 1000
zero latency count 3835 before latency 1000
zero latency count 3818 before latency 1000
zero latency count 1132 before latency 1000
"Returned from shopping."
zero latency count 2045 before latency 1000
zero latency count 671 before latency 1000
zero latency count 167 before latency 1000
zero latency count 147 before latency 1000

From the System Browser, inspecting the class variable LatencyCounts showed...
      0      171329
3000      1
2000      3
1000      49

P.S. after running a few more hours,  LatencyCounts shows...
0          1102691
1000    219
2000    12
3000     29
4000     21
5000     5
49000   1
75000    1

Reply | Threaded
Open this post in threaded view
|

Re: Curious context change latency experiment results (wrt DelayScheduler)

Eliot Miranda-2
Hi Ben,


On May 5, 2018, at 7:25 AM, Ben Coman <[hidden email]> wrote:



On 5 May 2018 at 22:10, Ben Coman <[hidden email]> wrote:
One of the key parts of Delay scheduling is setting the resumption time.
There are two places this could be done.
a. In Delay>>#schedule, running (for example) at userSchedulingPriority=40
b. In DelayMicrosecondScheduler>>#handleTImerEvent:, running at timingPriority=80

When we were using the millisecond clock for delay scheduling,
it made sense to use (b.) since the clock might(?) roll over 
between when resumption time was set, and when #handleTImerEvent: expires delays.

This should not be a problem now that we are using the microsecond clock, 
so I wondered what the latency penalty might be between (a.) and (b.).
I ran a little experiment that indicates the cost/latency of switching threads,
and was curious if anyone can comment on the validity of the experiment 
and interpretation of results.

I made the following three changes...

DelayScheduler subclass: #DelayMicrosecondScheduler
instanceVariableNames: 'latencyStart countSinceLastZero'
classVariableNames: 'LatencyCounts'
poolDictionaries: ''
category: 'Kernel-Processes'


DelayMicrosecondScheduler>>#schedule: aDelay
latencyStart:= Time primUTCMicrosecondsClock. "This is position (a.)"
aDelay schedulerBeingWaitedOn ifTrue: [^self error: 'This Delay has already been scheduled.'].
accessProtect critical: [
scheduledDelay := aDelay.
timingSemaphore signal. "#handleTimerEvent: sets scheduledDelay:=nil"
].


DelayMicrosecondScheduler>>#handleTimerEvent: microsecondNowTick
| microsecondNextTick |
"Process any schedule requests"     "This is position (b.)"
scheduledDelay ifNotNil: [
|latency|
latency := Time primUTCMicrosecondsClock - latencyStart.
LatencyCounts ifNil: [  LatencyCounts := Bag new ].
LatencyCounts add: latency.
latency = 0 
ifTrue: [ countSinceLastZero := 1 + (countSinceLastZero ifNil: [0])]
ifFalse: [Transcript 
crShow: 'zero latency count ' , countSinceLastZero printString ;
show: ' before latency ', latency printString.
countSinceLastZero := 0].
"Schedule the given delay."
scheduledDelay scheduler: self resumptionTime: microsecondNowTick + (1000 * scheduledDelay millisecondDelayDuration).
self scheduleDelay: scheduledDelay.
scheduledDelay := nil ].
<snip>
rest of method unchanged


Then opened the Transcript and in Playground evaluated...
Delay delaySchedulerClass: DelayMicrosecondScheduler.


The Transcript results are shown below with some comments inserted.

Now I guess the latency is affected by garbage collection. 
But one thing I was curious about is why the latency's were quantised in 1000s.  

Another interesting thing is that vast majority of the latency's were zero,
which was a nice surprise, but can it be true?  Or is it a consequence 
of the quantitisation rounding down?

It seems that the idle-ness of the image affected how often a non-zero latency occurred.
After I left the house for a while, the count of zero latency was very high, but a few still occurred.  It would make sense there was less GC while idle.  What is a good snippet of code to stress GC. I presume the latency might increase.


zero latency count 2273 before latency 1000
zero latency count 943 before latency 1000
zero latency count 3666 before latency 1000
zero latency count 1643 before latency 1000
zero latency count 27 before latency 1000
"Left house for 20 minutes"  
zero latency count 12022 before latency 1000
zero latency count 15195 before latency 1000
zero latency count 41998 before latency 1000
"Returned from outing"
zero latency count 128 before latency 1000
zero latency count 116 before latency 1000
zero latency count 555 before latency 1000
zero latency count 2377 before latency 1000
zero latency count 5423 before latency 1000
zero latency count 3178 before latency 1000
zero latency count 47 before latency 1000
zero latency count 2276 before latency 1000
"Left house to go shopping"
zero latency count 6708 before latency 3000
zero latency count 4896 before latency 1000
zero latency count 433 before latency 1000
zero latency count 7106 before latency 1000
zero latency count 2195 before latency 1000
zero latency count 12397 before latency 1000
zero latency count 4815 before latency 2000
zero latency count 3480 before latency 1000
zero latency count 5690 before latency 1000
zero latency count 1670 before latency 1000
zero latency count 320 before latency 1000
zero latency count 1641 before latency 2000
zero latency count 290 before latency 1000
zero latency count 3835 before latency 1000
zero latency count 3818 before latency 1000
zero latency count 1132 before latency 1000
"Returned from shopping."
zero latency count 2045 before latency 1000
zero latency count 671 before latency 1000
zero latency count 167 before latency 1000
zero latency count 147 before latency 1000

From the System Browser, inspecting the class variable LatencyCounts showed...
      0      171329
3000      1
2000      3
1000      49

P.S. after running a few more hours,  LatencyCounts shows...
0          1102691
1000    219
2000    12
3000     29
4000     21
5000     5
49000   1
75000    1

It's likely that the vm code is using the cheap microsecond click which is updated by the heartbeat, not the expensive one that interrogates the OS at the instant.  If not, it's the OS.  I'll check soon.
Reply | Threaded
Open this post in threaded view
|

Re: Curious context change latency experiment results (wrt DelayScheduler)

Ben Coman


On 6 May 2018 at 04:20, Eliot Miranda <[hidden email]> wrote:
Hi Ben,


On May 5, 2018, at 7:25 AM, Ben Coman <[hidden email]> wrote:



On 5 May 2018 at 22:10, Ben Coman <[hidden email]> wrote:
One of the key parts of Delay scheduling is setting the resumption time.
There are two places this could be done.
a. In Delay>>#schedule, running (for example) at userSchedulingPriority=40
b. In DelayMicrosecondScheduler>>#handleTImerEvent:, running at timingPriority=80

When we were using the millisecond clock for delay scheduling,
it made sense to use (b.) since the clock might(?) roll over 
between when resumption time was set, and when #handleTImerEvent: expires delays.

This should not be a problem now that we are using the microsecond clock, 
so I wondered what the latency penalty might be between (a.) and (b.).
I ran a little experiment that indicates the cost/latency of switching threads,
and was curious if anyone can comment on the validity of the experiment 
and interpretation of results.

I made the following three changes...

DelayScheduler subclass: #DelayMicrosecondScheduler
instanceVariableNames: 'latencyStart countSinceLastZero'
classVariableNames: 'LatencyCounts'
poolDictionaries: ''
category: 'Kernel-Processes'


DelayMicrosecondScheduler>>#schedule: aDelay
latencyStart:= Time primUTCMicrosecondsClock. "This is position (a.)"
aDelay schedulerBeingWaitedOn ifTrue: [^self error: 'This Delay has already been scheduled.'].
accessProtect critical: [
scheduledDelay := aDelay.
timingSemaphore signal. "#handleTimerEvent: sets scheduledDelay:=nil"
].


DelayMicrosecondScheduler>>#handleTimerEvent: microsecondNowTick
| microsecondNextTick |
"Process any schedule requests"     "This is position (b.)"
scheduledDelay ifNotNil: [
|latency|
latency := Time primUTCMicrosecondsClock - latencyStart.
LatencyCounts ifNil: [  LatencyCounts := Bag new ].
LatencyCounts add: latency.
latency = 0 
ifTrue: [ countSinceLastZero := 1 + (countSinceLastZero ifNil: [0])]
ifFalse: [Transcript 
crShow: 'zero latency count ' , countSinceLastZero printString ;
show: ' before latency ', latency printString.
countSinceLastZero := 0].
"Schedule the given delay."
scheduledDelay scheduler: self resumptionTime: microsecondNowTick + (1000 * scheduledDelay millisecondDelayDuration).
self scheduleDelay: scheduledDelay.
scheduledDelay := nil ].
<snip>
rest of method unchanged


Then opened the Transcript and in Playground evaluated...
Delay delaySchedulerClass: DelayMicrosecondScheduler.


The Transcript results are shown below with some comments inserted.

Now I guess the latency is affected by garbage collection. 
But one thing I was curious about is why the latency's were quantised in 1000s.  

Another interesting thing is that vast majority of the latency's were zero,
which was a nice surprise, but can it be true?  Or is it a consequence 
of the quantitisation rounding down?

It seems that the idle-ness of the image affected how often a non-zero latency occurred.
After I left the house for a while, the count of zero latency was very high, but a few still occurred.  It would make sense there was less GC while idle.  What is a good snippet of code to stress GC. I presume the latency might increase.


zero latency count 2273 before latency 1000
zero latency count 943 before latency 1000
zero latency count 3666 before latency 1000
zero latency count 1643 before latency 1000
zero latency count 27 before latency 1000
"Left house for 20 minutes"  
zero latency count 12022 before latency 1000
zero latency count 15195 before latency 1000
zero latency count 41998 before latency 1000
"Returned from outing"
zero latency count 128 before latency 1000
zero latency count 116 before latency 1000
zero latency count 555 before latency 1000
zero latency count 2377 before latency 1000
zero latency count 5423 before latency 1000
zero latency count 3178 before latency 1000
zero latency count 47 before latency 1000
zero latency count 2276 before latency 1000
"Left house to go shopping"
zero latency count 6708 before latency 3000
zero latency count 4896 before latency 1000
zero latency count 433 before latency 1000
zero latency count 7106 before latency 1000
zero latency count 2195 before latency 1000
zero latency count 12397 before latency 1000
zero latency count 4815 before latency 2000
zero latency count 3480 before latency 1000
zero latency count 5690 before latency 1000
zero latency count 1670 before latency 1000
zero latency count 320 before latency 1000
zero latency count 1641 before latency 2000
zero latency count 290 before latency 1000
zero latency count 3835 before latency 1000
zero latency count 3818 before latency 1000
zero latency count 1132 before latency 1000
"Returned from shopping."
zero latency count 2045 before latency 1000
zero latency count 671 before latency 1000
zero latency count 167 before latency 1000
zero latency count 147 before latency 1000

From the System Browser, inspecting the class variable LatencyCounts showed...
      0      171329
3000      1
2000      3
1000      49

P.S. after running a few more hours,  LatencyCounts shows...
0          1102691
1000    219
2000    12
3000     29
4000     21
5000     5
49000   1
75000    1

It's likely that the vm code is using the cheap microsecond click which is updated by the heartbeat, not the expensive one that interrogates the OS at the instant.  If not, it's the OS.  I'll check soon.

I went looking and found an obvious source of quantitization multiplying by MicrosecondsPerMillisecond 
in the Win32 currentUTCMicroseconds() line 109 here...

Running the same experiment (for 20 mins) on Linux produced a latency distribution in line with what I was originally expecting...
usec        count
1 135
2 992
3 3250
4 21161
5 17309
6 5341
7 1230
8 313
9 79
10 44
11 46
12 50
13 36
14 58
15 73
16 38
17 12
18 7
19 5
20 1
21 3
23 2
24 4
25 4
26 5
27 4
29 3
30 6
31 5
32 6
33 5
34 2
35 2
36 5
37 4
39 4
40 2
41 2
44 3
46 2
47 1
48 4
49 3
51 2
53 1
54 1
56 2
58 1
59 1
61 1
66 2
70 1
82 1
86 1
184 1
544 1
576 1
1212 1
1402 1
2016 1

self average.   "4.8"microseconds
self stdev.        "12.9"microseconds
self average + (5 * self stdev)    "69.1"microseconds
 
So there are nine outliers outside five sigma.
@Clement, Would you expect your incremental GC to have much impact on those outliers? 
or its target pause is ~1000usecs anyway?

cheers -ben
Reply | Threaded
Open this post in threaded view
|

Re: Curious context change latency experiment results (wrt DelayScheduler)

Eliot Miranda-2
Hi Ben,




_,,,^..^,,,_ (phone)
On May 26, 2018, at 10:46 PM, Ben Coman <[hidden email]> wrote:



On 6 May 2018 at 04:20, Eliot Miranda <[hidden email]> wrote:
Hi Ben,


On May 5, 2018, at 7:25 AM, Ben Coman <[hidden email]> wrote:



On 5 May 2018 at 22:10, Ben Coman <[hidden email]> wrote:
One of the key parts of Delay scheduling is setting the resumption time.
There are two places this could be done.
a. In Delay>>#schedule, running (for example) at userSchedulingPriority=40
b. In DelayMicrosecondScheduler>>#handleTImerEvent:, running at timingPriority=80

When we were using the millisecond clock for delay scheduling,
it made sense to use (b.) since the clock might(?) roll over 
between when resumption time was set, and when #handleTImerEvent: expires delays.

This should not be a problem now that we are using the microsecond clock, 
so I wondered what the latency penalty might be between (a.) and (b.).
I ran a little experiment that indicates the cost/latency of switching threads,
and was curious if anyone can comment on the validity of the experiment 
and interpretation of results.

I made the following three changes...

DelayScheduler subclass: #DelayMicrosecondScheduler
instanceVariableNames: 'latencyStart countSinceLastZero'
classVariableNames: 'LatencyCounts'
poolDictionaries: ''
category: 'Kernel-Processes'


DelayMicrosecondScheduler>>#schedule: aDelay
latencyStart:= Time primUTCMicrosecondsClock. "This is position (a.)"
aDelay schedulerBeingWaitedOn ifTrue: [^self error: 'This Delay has already been scheduled.'].
accessProtect critical: [
scheduledDelay := aDelay.
timingSemaphore signal. "#handleTimerEvent: sets scheduledDelay:=nil"
].


DelayMicrosecondScheduler>>#handleTimerEvent: microsecondNowTick
| microsecondNextTick |
"Process any schedule requests"     "This is position (b.)"
scheduledDelay ifNotNil: [
|latency|
latency := Time primUTCMicrosecondsClock - latencyStart.
LatencyCounts ifNil: [  LatencyCounts := Bag new ].
LatencyCounts add: latency.
latency = 0 
ifTrue: [ countSinceLastZero := 1 + (countSinceLastZero ifNil: [0])]
ifFalse: [Transcript 
crShow: 'zero latency count ' , countSinceLastZero printString ;
show: ' before latency ', latency printString.
countSinceLastZero := 0].
"Schedule the given delay."
scheduledDelay scheduler: self resumptionTime: microsecondNowTick + (1000 * scheduledDelay millisecondDelayDuration).
self scheduleDelay: scheduledDelay.
scheduledDelay := nil ].
<snip>
rest of method unchanged


Then opened the Transcript and in Playground evaluated...
Delay delaySchedulerClass: DelayMicrosecondScheduler.


The Transcript results are shown below with some comments inserted.

Now I guess the latency is affected by garbage collection. 
But one thing I was curious about is why the latency's were quantised in 1000s.  

Another interesting thing is that vast majority of the latency's were zero,
which was a nice surprise, but can it be true?  Or is it a consequence 
of the quantitisation rounding down?

It seems that the idle-ness of the image affected how often a non-zero latency occurred.
After I left the house for a while, the count of zero latency was very high, but a few still occurred.  It would make sense there was less GC while idle.  What is a good snippet of code to stress GC. I presume the latency might increase.


zero latency count 2273 before latency 1000
zero latency count 943 before latency 1000
zero latency count 3666 before latency 1000
zero latency count 1643 before latency 1000
zero latency count 27 before latency 1000
"Left house for 20 minutes"  
zero latency count 12022 before latency 1000
zero latency count 15195 before latency 1000
zero latency count 41998 before latency 1000
"Returned from outing"
zero latency count 128 before latency 1000
zero latency count 116 before latency 1000
zero latency count 555 before latency 1000
zero latency count 2377 before latency 1000
zero latency count 5423 before latency 1000
zero latency count 3178 before latency 1000
zero latency count 47 before latency 1000
zero latency count 2276 before latency 1000
"Left house to go shopping"
zero latency count 6708 before latency 3000
zero latency count 4896 before latency 1000
zero latency count 433 before latency 1000
zero latency count 7106 before latency 1000
zero latency count 2195 before latency 1000
zero latency count 12397 before latency 1000
zero latency count 4815 before latency 2000
zero latency count 3480 before latency 1000
zero latency count 5690 before latency 1000
zero latency count 1670 before latency 1000
zero latency count 320 before latency 1000
zero latency count 1641 before latency 2000
zero latency count 290 before latency 1000
zero latency count 3835 before latency 1000
zero latency count 3818 before latency 1000
zero latency count 1132 before latency 1000
"Returned from shopping."
zero latency count 2045 before latency 1000
zero latency count 671 before latency 1000
zero latency count 167 before latency 1000
zero latency count 147 before latency 1000

From the System Browser, inspecting the class variable LatencyCounts showed...
      0      171329
3000      1
2000      3
1000      49

P.S. after running a few more hours,  LatencyCounts shows...
0          1102691
1000    219
2000    12
3000     29
4000     21
5000     5
49000   1
75000    1

It's likely that the vm code is using the cheap microsecond click which is updated by the heartbeat, not the expensive one that interrogates the OS at the instant.  If not, it's the OS.  I'll check soon.

I went looking and found an obvious source of quantitization multiplying by MicrosecondsPerMillisecond 
in the Win32 currentUTCMicroseconds() line 109 here...

When that code was written there was no higher resolution alternative.  But now I see that there is GetSystemTimePreciseAsFileTime as of Windows 8/Server 2012.  Volunteers who would write in the alternative (by seeing if it exists in kernel32.dll and hence maintaining backward compatibility with Windows 7) are welcome.  We could really do with this on Windows.


Running the same experiment (for 20 mins) on Linux produced a latency distribution in line with what I was originally expecting...
usec        count
1 135
2 992
3 3250
4 21161
5 17309
6 5341
7 1230
8 313
9 79
10 44
11 46
12 50
13 36
14 58
15 73
16 38
17 12
18 7
19 5
20 1
21 3
23 2
24 4
25 4
26 5
27 4
29 3
30 6
31 5
32 6
33 5
34 2
35 2
36 5
37 4
39 4
40 2
41 2
44 3
46 2
47 1
48 4
49 3
51 2
53 1
54 1
56 2
58 1
59 1
61 1
66 2
70 1
82 1
86 1
184 1
544 1
576 1
1212 1
1402 1
2016 1

self average.   "4.8"microseconds
self stdev.        "12.9"microseconds
self average + (5 * self stdev)    "69.1"microseconds
 
So there are nine outliers outside five sigma.
@Clement, Would you expect your incremental GC to have much impact on those outliers? 
or its target pause is ~1000usecs anyway?

Not talking for Clément, I think it would.  But can your measurements correlate the outliers with GC pauses?  Can you extend the experiment to interrogate the GC counts?  For example, maintaining a WeakArray with: Object new, and observing when its element is nil is a cheap way of observing that a scavenge or GC has occurred, and a pin operation would ensure it was in old space.  So you could allocate a new one to observe scavenges and pin another one, assigning a fresh Object new to it to observe full GCs.

cheers -ben