VM stability / unit tests

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

VM stability / unit tests

Phil B
 
While I've been enjoying the fantastic performance improvements we've seen from Cog onward, one thing I've been less excited about are some of the stability/functionality issues I've been running into.  They are not numerous (maybe 1/2 dozen or so major ones in the last 5 years) but they are getting quite tedious to isolate and replicate.  Recent examples that come to mind include the 64-bit primHighResClock truncation and 'could not grow remembered set' issues  (My current joy is a case where I have an #ifTrue: block that doesn't get executed unless I convert it to an #ifTrue:ifFalse: with a noop for the ifFalse:.. I'll provide a reproducible test case as soon as I'm able.  The specific issue isn't the issue, but rather that I keep hitting things like this that seem rather fundamental yet edge-casey at the same time)

I don't expect perfection as a phenomenal amount of progress is being made by a small group of people but I am beginning to wonder if the existing unit tests are sufficient to adequately exercise the VM? I.e. so that the VM developers are aware that a recent change may have broken something or are the existing tests mainly oriented towards image and bytecode VM development?  Just some food for thought and wanted to see if it's just me having these sorts of issues...

Thanks,
Phil
Reply | Threaded
Open this post in threaded view
|

Re: VM stability / unit tests

Nicolas Cellier
 
Hi Phil,

that's probably right: there is a lack of smoke tests.

There are several hurdles before we reach today's state of the art wrt continuous delivery and regression testing

- 1) the artefacts must be built, or we won't even have a chance to run tests
  we can observe that they have been broken too many times by all sort of problems, and green status is an exception in an ocean of red
  problems encounterd so far are including
  * work in progress in core VM or plugins
  * wrong configuration of pharo target directories or credentials
    (this was the case most of the 2017 year but is fortunately fixed now)
  * staled or intermitent links (url)
    for example, the build is loading stuff from the network (like cygwin updates)
    that sometimes fail
  * failure to build a library due to some tool changes at appveyor/travis

Introduction of new bugs could be prevented if feedback was correct (no false alarm).
But it's not really the case until now (lot of parasites).

- 2) we run after many hares, that is a combination of
  v3 stack spur, i386 x86_64 ARM, linux Windows MacOS, sista lowcode, Squeak Pharo Newspeak
  I certainly forgot threaded FFI in above list, plus the register efficient JIT variants...

Again, breaking a single of these configurations lead to RED status.
Not all these configurations are at the same level
- of importance (less user, not used in production, ...)
- of maturity (in progress, experimental, or in production)
So we must find a way to prioritize and focus on production artifacts...

-3) we need stable image side for running smoke tests
But we need some image side changes for some new features, preventing to run older versions.
Squeak and Pharo still have randomly failing tests (like Network dependent, etc...).
Someone has to do the work (or pay it)...

-4) build status feedback is very sloowww
  * as said above, we build too many configurations
  * Pharo has introduced a lot of dependencies on external libraries
    this leads to either long build times, or the use of caches that delay detection of new failures

We all know that dev branches (feature branches) help a lot for some of the above problems.
But we have these additional hurdles:
- feature branches works well when cycles are short
  but core VM cycles are not short (3 to 6 months or more for introducing new GC, 64 bits, minimal SISTA, ...)
  a lot of the changes required for SISTA, 64bits and JIT variants are competing, and parallel branches would create conflicts and would not work without regular sync.
  that explains why all the branches are gathered into a giant and complex one today...
- versionning generated code is a recipe for creating unsolvable conflicts (unmergeable)
  it's still possible to generate code for a plugin (if non concurrent)
  but this prevents working in parallel branches as soon as the core generation is changed in VMMaker

In recent posts, I saw billiant young people under-estimating a bit the work involved and the complexity of the task.
Fabio has done a tremendous work to restore the green status, and the help of Esteban has been decisive with this respect.
We will never thank them enough for that.

But maybe current state is at the limit of sustainability.
And maybe it's time to drop some drag.



2018-03-30 22:35 GMT+02:00 Phil B <[hidden email]>:
 
While I've been enjoying the fantastic performance improvements we've seen from Cog onward, one thing I've been less excited about are some of the stability/functionality issues I've been running into.  They are not numerous (maybe 1/2 dozen or so major ones in the last 5 years) but they are getting quite tedious to isolate and replicate.  Recent examples that come to mind include the 64-bit primHighResClock truncation and 'could not grow remembered set' issues  (My current joy is a case where I have an #ifTrue: block that doesn't get executed unless I convert it to an #ifTrue:ifFalse: with a noop for the ifFalse:.. I'll provide a reproducible test case as soon as I'm able.  The specific issue isn't the issue, but rather that I keep hitting things like this that seem rather fundamental yet edge-casey at the same time)

I don't expect perfection as a phenomenal amount of progress is being made by a small group of people but I am beginning to wonder if the existing unit tests are sufficient to adequately exercise the VM? I.e. so that the VM developers are aware that a recent change may have broken something or are the existing tests mainly oriented towards image and bytecode VM development?  Just some food for thought and wanted to see if it's just me having these sorts of issues...

Thanks,
Phil


Reply | Threaded
Open this post in threaded view
|

Re: VM stability / unit tests

Eliot Miranda-2
In reply to this post by Phil B
 
Hi Phil,

> On Mar 30, 2018, at 1:35 PM, Phil B <[hidden email]> wrote:
>
> While I've been enjoying the fantastic performance improvements we've seen from Cog onward, one thing I've been less excited about are some of the stability/functionality issues I've been running into.  They are not numerous (maybe 1/2 dozen or so major ones in the last 5 years) but they are getting quite tedious to isolate and replicate.  Recent examples that come to mind include the 64-bit primHighResClock truncation and 'could not grow remembered set' issues  (My current joy is a case where I have an #ifTrue: block that doesn't get executed unless I convert it to an #ifTrue:ifFalse: with a noop for the ifFalse:.. I'll provide a reproducible test case as soon as I'm able.  The specific issue isn't the issue, but rather that I keep hitting things like this that seem rather fundamental yet edge-casey at the same time)
>
> I don't expect perfection as a phenomenal amount of progress is being made by a small group of people but I am beginning to wonder if the existing unit tests are sufficient to adequately exercise the VM? I.e. so that the VM developers are aware that a recent change may have broken something or are the existing tests mainly oriented towards image and bytecode VM development?  Just some food for thought and wanted to see if it's just me having these sorts of issues...

Part of the problem is in creating test frameworks that are stable enough and complex enough.  It's a lot of work.  Consider the most unstable part of Spur for the past year, the new compactor, which took a year to become fully reliable (touch wood).  The last case that showed the last bug I fixed required a really large image, a snapshot and a load of that snapshot followed by a GC to show the bug.

In fact what this shows is that writing regression tests is easy but writing adequate stress tests is hard.  In my experience it's more effective to let the community provide the stress tests and try and be as responsive as possible in fixing the bugs as soon as they appear.  So having knowledge of how to create reproducible cases, knowing the right channel to report a bug, etc, are important.

And if I'm right here then this points to the need for a workflow where VMs are built and tested automatically from tip.  I don't properly understand the issue, but I'm frustrated that the current Pharo vm is way behind that compactor bug fix.  I think the issue is that the Pharo vm has more than one tip; it has the execution engine/GC/FFI tip that Clément, Nicolas and I take responsibility for, and then there's the various library extensions (for git, fonts, imaging) that is a significant weight on Esteban's shoulders, and then there's SSL support from Tobias, etc.

So perhaps we need a two tier VM code base, so we can decouple these various tips and advance each tip to "the stable branch" when appropriate.  That in turn requires a CI infrastructure which allows developers of each tip to test their changes in the context of an otherwise stable code base.


>
> Thanks,
> Phil