Heisenberg strikes again

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Heisenberg strikes again

Schwab,Wilhelm K
Hello all,

I suspect Dr. Heisenberg is having a very good laugh at my expense :)  I
have a state machine of sorts that is, well, out of sorts.  To get a
handle on the problem, I have added some logging, and the fun part is
that to make the logs intelligible, I had to add some critical sections.
  It seems almost guaranteed that the logging will affect the behavior,
if only a small amount.  The logging is optional, and I can direct it to
various streams, including Trace.

On the enhancement front, I found it useful to use Trace's mutex, and
added a method to expose it.  Should that be included in the base system?

The state machine controls serial communications (two way) with an
external device.  It has send and receive semaphores, and they get
signalled under all kinds of crazy conditions - too many, it turns out.
The synchronized log entries for the state machine running against a
disconnected port show that it is piling up excess signals on the read
semaphore.  While I might be able to get away with it, it is not
intended, and really should be fixed.

Any thoughts on using #set vs. #signal?  I suspect I would be better off
to figure out exactly when I need to signal the semphore and "get it
right".  Do you agree?  Or is this a situation that screams for #set?

Have a good one,

Bill


--
Wilhelm K. Schwab, Ph.D.
[hidden email]


jas
Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

jas
Bill Schwab wrote:
> Hello all,
>
> I suspect Dr. Heisenberg is having a very good laugh at my expense :)  I
> have a state machine of sorts that is, well, out of sorts.  To get a
> handle on the problem, I have added some logging, and the fun part is
> that to make the logs intelligible, I had to add some critical sections.
>  It seems almost guaranteed that the logging will affect the behavior,
> if only a small amount.

As in - the window changes from slightly open
to slightly more than slightly open, and back again.
Fun, isn't it?

Seems every time I go that route, I end up solving
the problem without much help from the (increasingly
complex) logging.

   The logging is optional, and I can direct it to
> various streams, including Trace.
>
> On the enhancement front, I found it useful to use Trace's mutex, and
> added a method to expose it.  Should that be included in the base system?
>
> The state machine controls serial communications (two way) with an
> external device.  It has send and receive semaphores, and they get
> signalled under all kinds of crazy conditions - too many, it turns out.


Seems to me there shouldn't be very many places
that signalling occurs.  I'd guess 1-3 for each
direction, tops.  I suspect refactoring toward
OAOO will help.


> The synchronized log entries for the state machine running against a
> disconnected port show that it is piling up excess signals on the read
> semaphore.  While I might be able to get away with it, it is not
> intended, and really should be fixed.
>
> Any thoughts on using #set vs. #signal?  I suspect I would be better off
> to figure out exactly when I need to signal the semphore and "get it
> right".  Do you agree?  Or is this a situation that screams for #set?


Depends on how you're wired, Bill.
I'd be inclined to figure out what the state machine is,
and either eliminate the extraneous signalling directly,
or recognize it as non-determinism, and convert the NDFA
to an equivalent DFA.  Since #set masks extra signals,
I'd stick with #signal, until I *knew* that #set would
only be masking epsilon edges.

But you could take the other route, and just ensure that
it passes every test case your app cares about.

If you can increase the level of detail, either here,
or in an off-line message, I can take a deeper look.


Regards,

-cstb


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Schwab,Wilhelm K
> Seems every time I go that route, I end up solving
> the problem without much help from the (increasingly
> complex) logging.

In this case, I suspect it has already alerted me to a very important
problem.


> Seems to me there shouldn't be very many places
> that signalling occurs.  I'd guess 1-3 for each
> direction, tops.  I suspect refactoring toward
> OAOO will help.

It is helping as I work toward it.  Ironically, the concern is partly
due to the changes made to date.  Basically, I start these machines and
walk away.  They run for weeks at a time - 49 days and change - until
Delay breaks, or until patches force me to reboot.  During that time,
external devices get unplugged, partially unplugged, powered off and
restarted.  Regardless of unit tests, I am going to be nervous for a few
weeks :(



> Depends on how you're wired, Bill.
> I'd be inclined to figure out what the state machine is,
> and either eliminate the extraneous signalling directly,
> or recognize it as non-determinism, and convert the NDFA
> to an equivalent DFA.  

That's an interesting view of it.


 > Since #set masks extra signals,
> I'd stick with #signal, until I *knew* that #set would
> only be masking epsilon edges.

Fair enough - thanks for confirming my hunch.


> But you could take the other route, and just ensure that
> it passes every test case your app cares about.

Unfortunately, the tests require "random" events in real hardware; it
either runs for weeks at a time under battle conditions or it doesn't.
I have a growing collection of tests that verify the protocols, but the
timing is more difficult to extract into unit tests.



> If you can increase the level of detail, either here,
> or in an off-line message, I can take a deeper look.

Thanks!  I will probably stumble forward with trying to eliminate the
extraneous signals as well as I can, and turn it loose on the world (or
vice versa).  A more careful analysis will take some time, but I will
keep it in mind.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Eliot Miranda
Bill Schwab wrote:

> It is helping as I work toward it.  Ironically, the concern is partly
> due to the changes made to date.  Basically, I start these machines and
> walk away.  They run for weeks at a time - 49 days and change - until
> Delay breaks,

which is due to a 32-bit millisecond clock overflowing.
(2 raisedTo: 32) / (49 * 24 * 60 * 60 * 1000.0) =  1.0145

which is why VW went to a 64-bit microsecond clock...

--
_______________,,,^..^,,,____________________________
Eliot Miranda              Smalltalk - Scene not herd


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Chris Uppal-3
In reply to this post by Schwab,Wilhelm K
Bill,

>  I have added some logging, and the fun part is
> that to make the logs intelligible, I had to add some critical sections.
>   It seems almost guaranteed that the logging will affect the behavior,
> if only a small amount.  The logging is optional, and I can direct it to
> various streams, including Trace.

I can't help with your design problem with semaphores, but one way of avoiding
the Heisenberg effect might be to use Process-specific log streams.  There's a
"spare" slot in the definition of Process, called "tls" (I can't /guess/ what
that might stand for ;-) which you could hijack for your purposes (I've done so
before...).  The log streams would necessarily go to file(s), but if you
timestamp the data you should be able to reassemble the sequence of events
mechanically.

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Chris Uppal-3
In reply to this post by Eliot Miranda
Eliot Miranda wrote:

> which is why VW went to a 64-bit microsecond clock...

That's merely putting the problem off for a few hundred million years.  Bill
wants his machines to be /reliable/...

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Eliot Miranda
Chris Uppal wrote:
> Eliot Miranda wrote:
>
>
>>which is why VW went to a 64-bit microsecond clock...
>
>
> That's merely putting the problem off for a few hundred million years.  Bill
> wants his machines to be /reliable/...

:)

But you're a little optimistic...

(2 raisedTo: 64) / (365.5 * 24 * 3600 * 1000000.0) - (2005 - 1901)
= 584038.0

:)

--
_______________,,,^..^,,,____________________________
Eliot Miranda              Smalltalk - Scene not herd


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Schwab,Wilhelm K
In reply to this post by Chris Uppal-3
Chris,

>>which is why VW went to a 64-bit microsecond clock...
>
> That's merely putting the problem off for a few hundred million years.  Bill
> wants his machines to be /reliable/...

I certainly understand OA's caution; a rush to fix it could be
disasterous.  However, it is somewhat unusual to have somebody other
than Microsoft as a limiting factor in reliability.

Here is a thought: machines seem to know the date and time they started,
even beyond the 49 day cutoff.  Is there some way we could tap that info
and create a long integer time stamp that is accurate enough to make
delays work?

Have a good one,

Bill


--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Schwab,Wilhelm K
In reply to this post by Chris Uppal-3
Chris,

> I can't help with your design problem with semaphores, but one way of avoiding
> the Heisenberg effect might be to use Process-specific log streams.  There's a
> "spare" slot in the definition of Process, called "tls" (I can't /guess/ what
> that might stand for ;-) which you could hijack for your purposes (I've done so
> before...).  The log streams would necessarily go to file(s), but if you
> timestamp the data you should be able to reassemble the sequence of events
> mechanically.

That is a great idea.  Actually, I think I could collect data into write
streams over strings, and then output the whole mess after the fact
based on time stamps.

I think the problem is resolving itself (one of the semaphores is
probably not necessary!), but I will keep this in mind for this or
future problems.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Randy Coulman-2
In reply to this post by Chris Uppal-3
Chris Uppal wrote:
>
> I can't help with your design problem with semaphores, but one way of avoiding
> the Heisenberg effect might be to use Process-specific log streams.  There's a
> "spare" slot in the definition of Process, called "tls" (I can't /guess/ what
> that might stand for ;-) which you could hijack for your purposes (I've done so
> before...).  The log streams would necessarily go to file(s), but if you
> timestamp the data you should be able to reassemble the sequence of events
> mechanically.
>

TLS usually stands for thread-local storage, which seems appropriate in
this context.

Randy
--
Randy Coulman
NOTE: Reply-to: address is spam-guarded.  Reassemble the following to
reply directly:
rcoulman at charter dot net


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Schwab,Wilhelm K
In reply to this post by Schwab,Wilhelm K
Chris,

> I think the problem is resolving itself (one of the semaphores is
> probably not necessary!), but I will keep this in mind for this or
> future problems.

So much for that theory - I found the reason for the "extra" semaphore.
  However, I now better understand why it was useful, and can probably
arrive at a much simpler design that is sufficiently flexible.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Chris Uppal-3
In reply to this post by Eliot Miranda
Eliot Miranda wrote:

> > > which is why VW went to a 64-bit microsecond clock...
> >
> >
> > That's merely putting the problem off for a few hundred million years.
> > Bill wants his machines to be /reliable/...
>
> :)
>
> But you're a little optimistic...
>
> (2 raisedTo: 64) / (365.5 * 24 * 3600 * 1000000.0) - (2005 - 1901)
> = 584038.0

Ah, you said /micro/second clock, I missed that bit.

Dolphin does have a microsecond clock (only if the hardware's available -- or
so MSDN warns), but that's not used by the Delay class :-(

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Chris Uppal-3
In reply to this post by Schwab,Wilhelm K
Bill,

> Is there some way we could tap that info
> and create a long integer time stamp that is accurate enough to make
> delays work?

If I'm reading the code for Delay correctly, then the effect of a clock
wraparound is to make the Delay wake up early.  If so, then I imagine that it
would be easy (conceptually) to create a LongDelay which knew when (date and
time) it should wake up, and kept doing:

    (Delay forSeconds: <whatever>) wait

until the desired time had arrived (or passed).  It would simply continue the
loop if a delay expired too soon.

One unwanted complication is that it would probably have to handle daylight
saving changes.  At least that would be true for the general case -- I don't
know whether you need that feature.

I started to hack something together, but was reminded again of what a dreadful
state Dolphin's time and date classes are in (none of Date, Time, TimeStamp,
SYSTEMTIME, FILETIME, and LocalFILETIME work together properly, and between the
lot of them don't seem to provide a complete feature-set), and gave it up as
No Fun At All.  Still not difficult, though, just messy...

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Schwab,Wilhelm K
Chris,

> If I'm reading the code for Delay correctly, then the effect of a clock
> wraparound is to make the Delay wake up early.  

I respectfully doubt that is the case, only because it seems
inconsistent with the behavior of the affected systems.  Their mission
is to wake up every 30 seconds or so, and they fail to wake up after the
wrap.  Of course, sometimes they did wake up (which made the bug even
harder to find) if they managed to not create a critical delay that
spanned the wrap.

Technical babbling aside, it seems that if the delays did wake up early,
then the machines would simply fire off another 30 second work cycle a
little earlier than expected, and my pager wouldn't go off.  Of course
now that I understand the bug, I reboot before it happens - most of the
time ~:0   The hard part is explaining to somebody that they suffered
because I couldn't count to seven :(



 > If so, then I imagine that it
> would be easy (conceptually) to create a LongDelay which knew when (date and
> time) it should wake up, and kept doing:
>
>     (Delay forSeconds: <whatever>) wait
>
> until the desired time had arrived (or passed).  It would simply continue the
> loop if a delay expired too soon.

If I understood correctly, somebody recently did something similar for
Squeak.



> One unwanted complication is that it would probably have to handle daylight
> saving changes.  At least that would be true for the general case -- I don't
> know whether you need that feature.

For the most heavily affected systems, I do not allow them to make the
change w/o "permission".  We pick one time zone and stick with it for
the duration of surgery.  I have some servers for which time is less
critical, and I do allow them to change on their own; I could easily
treat them the same way.


> I started to hack something together, but was reminded again of what a dreadful
> state Dolphin's time and date classes are in (none of Date, Time, TimeStamp,
> SYSTEMTIME, FILETIME, and LocalFILETIME work together properly, and between the
> lot of them don't seem to provide a complete feature-set), and gave it up as
> No Fun At All.  Still not difficult, though, just messy...

Odd you mention this now.  Just days ago, I created a #fromMilliseconds:
  for TimeStamp (or something like that - too lazy to check right now).
  It seemed to work but needs some tests.  Anybody interested in it?

Have a good one,

Bill


--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Heisenberg strikes again

Chris Uppal-3
Bill,

> > If I'm reading the code for Delay correctly, then the effect of a clock
> > wraparound is to make the Delay wake up early.
>
> I respectfully doubt that is the case, only because it seems
> inconsistent with the behavior of the affected systems.

Ah, sorry.  I was only guessing at what would happen if
ProcessorScheduler>>signal:afterMilliseconds: were passed a negative value.
Looking at it, it seems that it signals the semaphore immediately if the
milliseconds count is a /small/ negative number, but that the primitive fails
with a walkback if the delay is something like,

    (2**31) negated

as it would be (I think) when the clock wraps.  I don't know, but maybe for
your purposes as simple a fix as changing Delay>>getDuration so that it never
returns less than 0 (or less than -1) would avoid the problem ?  That should (I
think) make wrapped delays expire immediately which you said wouldn't cause you
any particular problems (and is reasonably easy to code around if necessary).

Difficult to test, though...


> Odd you mention this now.  Just days ago, I created a #fromMilliseconds:
>   for TimeStamp (or something like that - too lazy to check right now).
>   It seemed to work but needs some tests.  Anybody interested in it?

I've pretty much settled on using FILETIME for everything these days.  BTW, for
anyone interested, don't forget that Ian's goodies include an implementation of
the ANSI DateAndTime stuff.  I haven't used it myself (I'd forgotten all about
it, to he honest) but it looks good.

    -- chris