Tests and software process

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Tests and software process

Ralph Johnson
Squeak comes with a large set of SUnit tests.  Unfortunately, some of
them don't work.  As far as I can tell, there is NO recent version of
Squak in which all the tests work.

This is a sign that something is wrong.  The main purpose of shipping
tests with code is so that people making changes can tell when they
break things.  If the tests don't work then people will not run them.
If they don't run the tests then the tests are useless.  The current
set of tests are useless because of the bad tests.  Nobody complains
about them, which tells me that nobody runs them.  So, it is all a
waste of time.

If the tests worked then it would be easy to make a new version.
Every bug fix would have to come with a test that illustrates the bug
and shows that it has been fixed.  The group that makes a new version
would check that all tests continue to work after the bug fix.

An easy way to make all the tests run is to delete the ones that don't
work.  There are thousands of working tests and, depending on the
version, dozens of non-working tests.  Perhaps the non-working tests
indicate bugs, perhaps they indicate bad tests.  It seems a shame to
delete tests that are illustrating bugs.  But if these tests don't
work, they keep the other tests from being useful.  Programmers need
to know that all the tests worked in the virgin image, and that if the
tests quit working, it is there own fault.

No development image should ever be shipped with any failing tests.

-Ralph Johnson

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

Daniel Vainsencher-6
Hi Ralph,

Of course you're right, this has been an issue for quite a while. I
think the problem is that tests have diverse domains of validity, and
there are neither abstractions nor infrastructure in place to support them.

In theory (and often in practice) you run "the" test suite every few
minutes, and a test fails iff some code is broken. Wonderful!
unfortunately, in a large scale, distributed, diverse effort like
Squeak, things are more complicated.

Examples:
- Platform specific tests.
- Very long running tests, which for most people don't give enough value
for their machine time.
- Non-self-contained tests, for example ones that require external files
to be present.
- Performance tests (only valid on reasonably fast machines. And this
might change over time...)

All of these do have some value in some context, but some cannot be
expected to be always green, and some aren't even worth running most of
the time. And the problem is that our current choice about "where/when
should this test run" is currently binary - everywhere, or nowhere. You
say we should be more aggressive in making this binary decision, but the
reason this isn't happening is that sometimes neither option is quite right.

The community has moved back and forth between extracting some/all tests
into an optional package, but in practice that just means they never get
run.

Do you know of some set of abstractions/practices/framework to deal with
this problem?

Daniel Vainsencher

Ralph Johnson wrote:

> Squeak comes with a large set of SUnit tests.  Unfortunately, some of
> them don't work.  As far as I can tell, there is NO recent version of
> Squak in which all the tests work.
>
> This is a sign that something is wrong.  The main purpose of shipping
> tests with code is so that people making changes can tell when they
> break things.  If the tests don't work then people will not run them.
> If they don't run the tests then the tests are useless.  The current
> set of tests are useless because of the bad tests.  Nobody complains
> about them, which tells me that nobody runs them.  So, it is all a
> waste of time.
>
> If the tests worked then it would be easy to make a new version.
> Every bug fix would have to come with a test that illustrates the bug
> and shows that it has been fixed.  The group that makes a new version
> would check that all tests continue to work after the bug fix.
>
> An easy way to make all the tests run is to delete the ones that don't
> work.  There are thousands of working tests and, depending on the
> version, dozens of non-working tests.  Perhaps the non-working tests
> indicate bugs, perhaps they indicate bad tests.  It seems a shame to
> delete tests that are illustrating bugs.  But if these tests don't
> work, they keep the other tests from being useful.  Programmers need
> to know that all the tests worked in the virgin image, and that if the
> tests quit working, it is there own fault.
>
> No development image should ever be shipped with any failing tests.
>
> -Ralph Johnson
>


Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

Howard Stearns
Daniel Vainsencher wrote:

> ...
> unfortunately, in a large scale, distributed, diverse effort like
> Squeak, things are more complicated.
>
> Examples:
> ...And the problem is that our current choice about "where/when
> should this test run" is currently binary - everywhere, or nowhere. ...
>
> Do you know of some set of abstractions/practices/framework to deal with
> this problem?

In the lisp community (which is also based on image+package), the
abstraction for software-package (called a "system") encompasses
version, dependencies, and operation, where operation is generally
considered to include test.

A system is defined as a set of modules with a type tag, including
"test", as well as "foreign libraries", "documentation", Lisp "source
code", and other systems recursively. Modules define other metadata,
including various kinds of dependencies.

You perform an operation on a system, such as "load" or "test". The
machinery collects all the dependencies based on the operation and any
operations that the specified operation requires. This collection is
based on knowledge of what has already been successfully performed in
the current running image and which is still valid (e.g., that source
hasn't changed). The resulting partial orderings of dependencies are
then topologically sorted to produce a total ordering of operations on
modules.

The more general such system tools allow developers to define their own
operation and module types, without having to re-engineer the system
tools themselves.

The result is that developer's can pretty readily test any combination
of systems in a meaningful way.

The code for doing all this was really quite small and understandable. I
was part of a group who used it for planning manufacturing operations in
a factory.

Alas, every Lisp organization and nearly every programmer has written
his own version of this general mechanism, so no standard emerged (as of
  my last experience with this, circa '99). I don't know whether this
means that the model wasn't quite right, or that Lisp programmers are
perverse.

References:
http://www.google.com/search?q=lisp+defsystem
http://www.google.com/search?q=lisp+define-system
http://www.google.com/search?q=lisp+waters+regression+test

--
Howard Stearns
University of Wisconsin - Madison
Division of Information Technology
voice:+1-608-262-3724

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

Diego Fernández
In reply to this post by Daniel Vainsencher-6
On 11/1/06, Daniel Vainsencher <[hidden email]> wrote:
> Do you know of some set of abstractions/practices/framework to deal with
> this problem?

Yes. TestSuites can be used to group tests.
That's what I was trying to say in:
http://lists.squeakfoundation.org/pipermail/squeak-dev/2006-October/110461.html

...but I think that the mail was lost between all the mails that comes
to the list :(

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

Ralph Johnson
In reply to this post by Daniel Vainsencher-6
On 11/1/06, Daniel Vainsencher <[hidden email]> wrote:
> Do you know of some set of abstractions/practices/framework to deal with
> this problem?

Sure.  Mostly practices.  I think we have enough abstractions and
frameworks already, though they could be better.

Divide tests into the ones that you expect to be run by people
developing other packages, and those run by developers of your
package.  The first are going to be included with your package, and
the second will be in a separate package in MC.

There is a base image that includes some set of packages.  Other
packages are said to "work with the base image".  The tests in the
base image all run on all platforms.  There might be platform specific
tests, but they are not in the base image.  If a package works with
the base image then when you load it, all of the original tests in the
image will work, and the tests with the package will work.
Presumably, the private tests for the package will work, too, but that
is up to the developer of the package.

Of course, just because two packages work with the base image does not
mean that they will work with each other.  It makes sense to have a
"universe" in which any combination of packages in the universe will
work with each other, as well as with the base image.  This takes more
testing and certification, and there has to be a "universe maintainer"
who does this.  In theory it is easy, in practice it is a lot of work.
 But much of the work can be automated.  We can worry about this after
we have a base image in which all tests work.  The first priority is
to create a world in which developers can assume that any broken tests
are their fault.

The current set of tests in Squeak are not too bad.  I think they will
take on the order of half an hour on a fast machine, so it is possible
for a developer to run them all before releasing code.  People do not
run all the tests every time they make a little change, no matter what
the books say.  People tend to pick the most relevant test suites and
run them after each little change, so those tests will run in  just a
few minutes.  They don't run long tests very often.  So, I do not
think that speed is the problem.

The problem is that all the tests is a relase-image should work.
Period.  Either fix them or remove them.  From then on, if someone
offers a patch and the patch breaks some tests, reject the patch.
Never release an image with broken tests.  Don't accept code that
breaks tests unless you are trying to help the author and plan to fix
the tests yourself.

-Ralph

Reply | Threaded
Open this post in threaded view
|

[ANN] ICal occurrence API

J J-6
In reply to this post by Diego Fernández
Hello all,

I just published a new version of ICal (well a couple, ignore the first
jbjohns) that now supports querying what occurences of an event are
described by a recurrence rule.  The public API consists of the following 6
methods.

ICEvent>>occurrences
ICEvent>>occurrencesAfter: aTimeSpan
ICEvent>>occurrencesBetween: aStartTimeSpan and: anEndTimeSpan
ICEvent>>occurrences: aNumber
ICEvent>>occurrences: aNumber after: aTimeSpan
ICEvent>>occurrences: aNumber between: aStartTimeSpan and: anEndTimeSpan

The first two require the rule to have a count or until directive, since
otherwise the set would be infinite (I will consider infinite sets later :)
).  All methods are constrained by the rule (i.e. if the rule has a count
directive of 4 then occurrences: 6 will still return only 4).  The
ICEvent>>isValidForDate: method was also changed, so that it checks if the
given date is in the set.

Things to be aware of:
Right now the API only works for monthly recurrence rules, but I plan to put
in more soon (I will be focusing on Weekly and above).  The rest will spit
out some "does not understand" messages for the occurrence methods, but
otherwise, everything works as before.
Right now the occurrence methods just return an ordered list of dates.  I
haven't decided yet what should be returned (just a DateAndTime, or maybe a
complete event representing that day?) so I have just deferred for now.  Let
me know what would be the most useful to you.
The classes wont change and the API listed above wont change, but the
methods in the ICFrequency classes will be moved around some.
BUG: If your ICEvent uses multiple rules and they have common dates between
them, they will all be in the returned set (i.e. there can be multiples of
the same date).  I am thinking of using some other data structure then
OrderedCollection to fix this problem, and remove the need for sorting to
happen in various spots throughout.
ExclusionRules are not handled at the moment.
The methods are designed for TimeSpan resolution, but right now some of the
lower level methods work on Date's.  This should mostly be transparent,
except that the set returned are Date's, instead of DateAndTime's like they
should be. :)

Thanks.  Hope this is useful to someone. :)
Jason

_________________________________________________________________
Stay in touch with old friends and meet new ones with Windows Live Spaces
http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.aspx&mkt=en-us


Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

Jason Rogers-4
In reply to this post by Ralph Johnson
On 11/1/06, Ralph Johnson <[hidden email]> wrote:
> No development image should ever be shipped with any failing tests.

... except if those tests are really customer/functional tests.  They
aren't expected to run "green" 100% of the time.  I don't know if that
is the case with these tests (I suspect that it is not).  What would
be useful is to package up tests into real Unit/Developer tests (as
has been suggested) and functional tests.

--
Jason Rogers

"Where there is no vision, the people perish..."
    Proverbs 29:18

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

keith1y
In reply to this post by Ralph Johnson
Daniel wrote:

> I think the problem is that tests have diverse domains of validity,
> and there are neither abstractions nor infrastructure in place to
> support them.

> Examples:
> - Platform specific tests.
> - Very long running tests, which for most people don't give enough
> value for their machine time.
> - Non-self-contained tests, for example ones that require external
> files to be present.
> - Performance tests (only valid on reasonably fast machines. And this
> might change over time...)
> ...

> Do you know of some set of abstractions/practices/framework to deal
> with this problem?
Yes! Or at least a step in the right direction  ;-)

SSpec 0.13, which I have recently ported to squeak, is hardwired to
define a suite of tests using the method category 'specs', whereas as
you know SUnit effectively hardwires the definition of a suite of tests
by methods beginning with 'test*'

In order to integrate the two with the same TestRunner I have made steps
to combine the two with a more generic and flexible solution.

For many years I have been adding code to SUnit that allows a TestCase
to publish a number of different test suites for different contexts as
suggested above.

I have now added and extended this feature to SUnit in the hope that it
may be adopted in 3.9+. I have also extended TestRunner to provide a UI
for selecting which published suite(s) to run.

All that is needed is to define some conventions for naming suites that
the community may find useful, including I presume: tests that should
always pass in an image release, tests that are specific to a particular
release, tests that illustrate bugs to be addressed, tests that
highlight when certain fixes have not been loaded or the external
environment is/is not as expected, long tests, and test suites
associated with particular products, or specialist releases.

I have also taken the liberty of reorganizing the Class categories from
SUnit in order to integrate more nicely with SSpec.

I adopted the following Top level categories which I put forward as a
suggestion for 3.10 if others are agreeable.

Testing-SUnit
Testing-SSpec
Testing-Common
Testing Tests-SUnit
Testing Tests-SSpec

How the scheme works. A TestCase, (or SpecContext) class defines
#publishedSuites such that

TestCase-c-#publishedSuites
^#( #allStandardTests #longTests #squeak39release #knownBugs
#allStandardAndLongTests)

each of the nominated published suits defines a match string which
defines the particular test suite. The match string supports '|' for
'or' so as to support multiple matches. Also the match string matches
against both the method name and the method category name together.

<method match string>@<category match string>


TestCase-c-#allStandardTests
^ '*test|*@tests*'

TestCase-c-#longTests
^ '*longtest'

TestCase-c-#allStandardAndLongTests

^ 'test*|longTest*'

The new api for building suites is based upon #suite: . e.g. (myTestCase
suite: '*@mytests') would return a test suite consisting of all the test
methods in the category 'mytests'. The testRunner can build a single
suite across multiple classes by using (TestCase-c-#suite: <match>
addTo: <suite>) together with the information gathered from
#publishedSuites.

You will find the code and the test runner in

http://www.squeaksource.com/Testing

I havent finished the spec integration with TestRunner, though SSpec can
be used with the TextRunner.

enjoy, and do let me know what you think.

Keith




Send instant messages to your online friends http://uk.messenger.yahoo.com 

Reply | Threaded
Open this post in threaded view
|

RE: Tests and software process

J J-6
In reply to this post by Ralph Johnson
+1


>From: "Ralph Johnson" <[hidden email]>
>Reply-To: The general-purpose Squeak developers
>list<[hidden email]>
>To: "The general-purpose Squeak developers
>list"<[hidden email]>
>Subject: Tests and software process
>Date: Wed, 1 Nov 2006 07:30:30 -0600
>
>Squeak comes with a large set of SUnit tests.  Unfortunately, some of
>them don't work.  As far as I can tell, there is NO recent version of
>Squak in which all the tests work.
>
>This is a sign that something is wrong.  The main purpose of shipping
>tests with code is so that people making changes can tell when they
>break things.  If the tests don't work then people will not run them.
>If they don't run the tests then the tests are useless.  The current
>set of tests are useless because of the bad tests.  Nobody complains
>about them, which tells me that nobody runs them.  So, it is all a
>waste of time.
>
>If the tests worked then it would be easy to make a new version.
>Every bug fix would have to come with a test that illustrates the bug
>and shows that it has been fixed.  The group that makes a new version
>would check that all tests continue to work after the bug fix.
>
>An easy way to make all the tests run is to delete the ones that don't
>work.  There are thousands of working tests and, depending on the
>version, dozens of non-working tests.  Perhaps the non-working tests
>indicate bugs, perhaps they indicate bad tests.  It seems a shame to
>delete tests that are illustrating bugs.  But if these tests don't
>work, they keep the other tests from being useful.  Programmers need
>to know that all the tests worked in the virgin image, and that if the
>tests quit working, it is there own fault.
>
>No development image should ever be shipped with any failing tests.
>
>-Ralph Johnson
>

_________________________________________________________________
Find a local pizza place, music store, museum and more…then map the best
route!  http://local.live.com?FORM=MGA001


Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

keith1y
In reply to this post by keith1y
Errata in previous message:

>
> The new api for building suites is based upon #suite: . e.g.
> (myTestCase suite: '*@mytests') would return a test suite consisting
> of all the test methods in the category 'mytests'.
I had forgotten that the most recent incarnation of this api already
works in conjunction with the #publishedSuites. So that you obtain a
suite with a call to #suite: supplying the publishedSuite selctor.

myTestCase suite: #allStandardTests

or

myTestCase suite: #longTests

best regards

Keith
Send instant messages to your online friends http://uk.messenger.yahoo.com 

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

keith1y
In reply to this post by J J-6


No development image should ever be shipped with any failing tests.

-Ralph Johnson
-1

No development image should ever be shipped with any failing tests without some context to explain why they are failing.

There may be tests that are intended to fail, particularly those that are intended to validate the external environment for a product. Of course the most annoying case being the example failure raising test case in SUnit that is placed there for beginners to see what a failure looks like.
So, better to say, that no development image should ever be shipped with any failing tests that are associated with the domain of 'ensuring that the development image works as expected'. There may be other domains such as 'known bugs still to be fixed'.

Keith


Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

Ralph Johnson
On 11/1/06, Keith Hodges <[hidden email]> wrote:>

>  No development image should ever be shipped with any failing tests.
>
>  -Ralph Johnson
>  -1
>
>  No development image should ever be shipped with any failing tests without
> some context to explain why they are failing.
>
> There may be tests that are intended to fail, particularly those that are
> intended to validate the external environment for a product. Of course the
> most annoying case being the example failure raising test case in SUnit that
> is placed there for beginners to see what a failure looks like.
>  So, better to say, that no development image should ever be shipped with
> any failing tests that are associated with the domain of 'ensuring that the
> development image works as expected'. There may be other domains such as
> 'known bugs still to be fixed'.

The first thing I do when I start working with a new image is to
delete the SUnit tests.  The one that always fails is especially
annoying.  It is useful for people porting SUnit to a new platform,
but it is not useful to most people.

"Known bugs still to be fixed" should be a separate package that you
can load if you are going to fix the bugs.  There should not be any
SUnit tests like that in a released, stable image.  Not even in an
alpha or a beta image.

One of the main purpose of tests is to let you know when you broke
something.  They will not have this function as long as some of them
fail.  If tests fail then either 1) delete them 2) fix the code so
they no longer fail or 3) move them to a package on MC, or anywhere
not in the image.

-Ralph Johnson

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

keith1y

> "Known bugs still to be fixed" should be a separate package that you
> can load if you are going to fix the bugs.  There should not be any
> SUnit tests like that in a released, stable image.  Not even in an
> alpha or a beta image.
This might be an ideal, but I don't think that it is practical, since an
individual bug to fix test may well (should) exist in the context of the
other tests for that package/set of functionality. Making a separate
change-set or package for such a test just seems too much. Once the fix
is made then the  individual bug-test method  from the
separate-bugs-to-fix-package would then have to be integrated etc etc.

I would prefer to have the full information available to me in order to
assess the state of an image, whether declared stable or not. For me the
50 or so tests that fail in the 3.9 image would be ok if they were in a
"tests we expect to fail" category.

With the scheme I propose, you select the "tests for release 3.10"
category of tests, and hit run, if all the tests pass then great. The
existence of miscellaneous snippets of code in the image that are not
part of that category is simply an ignorable artefact for the purpose of
validating that release. Those snippets might be named #bugtestMyBug or
#release39Test or #extraLongTest.

my 2p

Keith



       
       
               
___________________________________________________________
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine
http://uk.docs.yahoo.com/nowyoucan.html

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

Hans-Martin Mosner
In reply to this post by Ralph Johnson
Ralph Johnson schrieb:
> Squeak comes with a large set of SUnit tests.  Unfortunately, some of
> them don't work.  As far as I can tell, there is NO recent version of
> Squak in which all the tests work.
>
> This is a sign that something is wrong.  
Yup. To strengthen the upcoming trend of "do something" I have
investigated all the failing test cases in a 3.9-RC3-7066 image. The
results are at http://wiki.squeak.org/5889 - please feel free to comment.

Incidentally, there are very few classes of problems which are
responsible for most failing cases:

One (which causes half of the failures and errors) is missing features
in the MVC implementation of ToolBuilder. In my opinion, the
MVCToolBuilderTests should simply clamp these down by overriding the
test cases which can not possibly work in MVC with empty methods.

Then there are a number of FloatMathPlugin tests which just test whether
a sequence of floating point operations on a huge number of pseudorandom
floats exactly yields a specified result. In one case, the result on my
machine is not equal to the result specified in the test, but in more
cases the pseudorandom inputs are simply not applicable to the
mathematical functions under test. This indicates a problem with the
test and not with the plugin.

There are a small number of SqueakMap and Monticello tests which I don't
understand. These should be checked by the developers.

One test (ReleaseTest>>#testUnimplementedNonPrimitiveCalls) should
simply not be a unit test. This is a lint test which may be valuable as
far as it concerns your own code, but unless we want a very rigid
release regime this does not make sense here.

What's left is a very short list of genuine bugs. Some are simple to
fix, others probably require intensive debugging.
Expect less than 10 failing unit tests in 3.9 by the end of this week.

Cheers,
Hans-Martin

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

keith1y
 
>> Squeak comes with a large set of SUnit tests.  Unfortunately, some of
>> them don't work.  As far as I can tell, there is NO recent version of
>> Squak in which all the tests work.
>>
>> This is a sign that something is wrong.  
>>    
agreed.

> Yup. To strengthen the upcoming trend of "do something" I have
> investigated all the failing test cases in a 3.9-RC3-7066 image. The
> results are at http://wiki.squeak.org/5889 - please feel free to comment.
>  
> Expect less than 10 failing unit tests in 3.9 by the end of this week.
>
> Cheers,
> Hans-Martin
>
>  
fantastic

Keith

Send instant messages to your online friends http://uk.messenger.yahoo.com 

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

Andreas.Raab
In reply to this post by Hans-Martin Mosner
Hans-Martin Mosner wrote:
> Then there are a number of FloatMathPlugin tests which just test whether
> a sequence of floating point operations on a huge number of pseudorandom
> floats exactly yields a specified result. In one case, the result on my
> machine is not equal to the result specified in the test, but in more
> cases the pseudorandom inputs are simply not applicable to the
> mathematical functions under test. This indicates a problem with the
> test and not with the plugin.

I wrote those tests to make sure we have consistent (bit-identical)
results for various floating point functions across different Croquet
VMs. How these tests ended up in 3.9 I have no idea - they are part of
Croquet, for sure, and in the context of Croquet they make perfect sense
(and they pass if you use a Croquet VM and they fail if you don't -
which is exactly what they should do).

To me, it points out more a problem with the selection of code being put
into the base image rather than any failing of the test itself. The test
is meaningful in the context it was designed for.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

Hans-Martin Mosner
Andreas Raab schrieb:
> I wrote those tests to make sure we have consistent (bit-identical)
> results for various floating point functions across different Croquet
> VMs. How these tests ended up in 3.9 I have no idea - they are part of
> Croquet, for sure, and in the context of Croquet they make perfect
> sense (and they pass if you use a Croquet VM and they fail if you
> don't - which is exactly what they should do).
That's good. Just out of curiosity: Does the FloatMathPlugin used in
Croquet fail on any invalid inputs (e.g. numbers outside the range -1..1
for arcCos) or does it return NaN? I'd guess it returns NaN because
otherwise some of the tests could not possibly succeed.
>
> To me, it points out more a problem with the selection of code being
> put into the base image rather than any failing of the test itself.
> The test is meaningful in the context it was designed for.
Agreed. As far as I remember, the FloatMathPlugin for Croquet uses a
software implementation for some operations to achieve the goal of
bit-identical computation on all platforms. This probably means that the
functions are quite a bit slower, so including this in an environment
where the requirement is not present does not make much sense.
So removing these tests from the general Squeak image seems like the
reasonable thing to do, right?

BTW, I will try to run the tests with a Croquet VM soo, so then I will
know the answer to my first question :-)

Cheers,
Hans-Martin

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

Andreas.Raab
Hans-Martin Mosner wrote:

> Andreas Raab schrieb:
>> I wrote those tests to make sure we have consistent (bit-identical)
>> results for various floating point functions across different Croquet
>> VMs. How these tests ended up in 3.9 I have no idea - they are part of
>> Croquet, for sure, and in the context of Croquet they make perfect
>> sense (and they pass if you use a Croquet VM and they fail if you
>> don't - which is exactly what they should do).
> That's good. Just out of curiosity: Does the FloatMathPlugin used in
> Croquet fail on any invalid inputs (e.g. numbers outside the range -1..1
> for arcCos) or does it return NaN? I'd guess it returns NaN because
> otherwise some of the tests could not possibly succeed.

Actually, this also not the latest version of these tests. In Croquet we
use the CroquetVMTests suite which includes these and other tests. And
yes, the plugin fails (I added that when noticing that -thanks to
IEEE754- different platforms would report different bit-patterns for
NaN; all in compliance with the spec!) and the exception is handled by
simply resuming with NaN so that the test can successfully complete.

>> To me, it points out more a problem with the selection of code being
>> put into the base image rather than any failing of the test itself.
>> The test is meaningful in the context it was designed for.
> Agreed. As far as I remember, the FloatMathPlugin for Croquet uses a
> software implementation for some operations to achieve the goal of
> bit-identical computation on all platforms. This probably means that the
> functions are quite a bit slower, so including this in an environment
> where the requirement is not present does not make much sense.
> So removing these tests from the general Squeak image seems like the
> reasonable thing to do, right?

Yes. (although it's not as slow as one may think as long as you can use
a "native" sqrt instruction which is fortunately the _one_ insn that the
FPUs seem to agree upon)

> BTW, I will try to run the tests with a Croquet VM soo, so then I will
> know the answer to my first question :-)

Run the CroquetVMTests instead. Those are really the relevants ones.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Tests and software process

Lex Spoon
In reply to this post by Ralph Johnson
"Ralph Johnson" <[hidden email]> writes:
> Squeak comes with a large set of SUnit tests.  Unfortunately, some of
> them don't work.  As far as I can tell, there is NO recent version of
> Squak in which all the tests work.

Known-failing tests should be marked in some way or another.  Thus far
people have proposed putting them in separate packages so that you can
simply unload them.

That is not a bad solution.  However, it would be better if you can
load a known-failing test without having the tools bother you.  Then
you can see the tests and mess with them.  To achieve this, however,
you need to have a way to mark this in the image.

The simplest way I can think of is to rename the methods for
known-failing tests.  Right now, a method named testFoo is a unit
test.  We could change that so that pendtestFoo is a pending unit test
that is known not to work.


-Lex




Reply | Threaded
Open this post in threaded view
|

[pedantic] Re: Tests and software process

Howard Stearns
I apologize in advance for going off again. (See my earlier response to
the call for info/best-practices.) But I can't help but think about
"those who fail to learn from history..."  Here we are with all these
wonderful and practical ideas about doing just "one more thing" to make
stuff really work, and I'm unhelpfully being abstract. Sorry. And I
don't mean to discourage anyone. Just giving a heads-up.

What's fundamentally at issue with tests and with packages? At their
core, these are both declarative-style collections of things to be
operated on. The "win" is that the user doesn't need to manually carry
out the operations, and that the system can manage the combination of
collections and operations. (E.g., interleave complex combinations.)

The key point I'm trying to make is to recognize that we're dealing with
collections of stuff that can be combined, and with operations that can
be combined. As I understand them, SUNIT and MC don't really handle
combinations of collections, nor do they interact with each other for
combinations of operations. (Monticello configurations are a step in
this direction, and before/after scripts in MC can be used to
procedurally achieve some manual combination of operations, but then
you're losing the power of the declarative definitions.)

So my feeling is that the various improvements to MC and SUNIT are just
messing with the margins. Half a loaf. [This community will quite
rightly shout, "Go ahead and do it!" I'll answer in advance that I'm
working on other issues. This isn't on my critical path. Besides, my
meta-point is that this whole area is an already-solved problem. If I
had a student to throw at this...]

-H

[Another possible objection to the idea of combination is that what Lisp
did sounds like Mark Twain's comments on smoking. "It's easy to quit
smoking. I've done it hundreds of times!" The fact that this problem has
"been solved" so many times might mean that it hasn't. My personal view
is that it HAS been solved, but that there are other (social) issues
that have caused it to be repeatedly solved in the Lisp community.]

Lex Spoon wrote:

> "Ralph Johnson" <[hidden email]> writes:
>> Squeak comes with a large set of SUnit tests.  Unfortunately, some of
>> them don't work.  As far as I can tell, there is NO recent version of
>> Squak in which all the tests work.
>
> Known-failing tests should be marked in some way or another.  Thus far
> people have proposed putting them in separate packages so that you can
> simply unload them.
>
> That is not a bad solution.  However, it would be better if you can
> load a known-failing test without having the tools bother you.  Then
> you can see the tests and mess with them.  To achieve this, however,
> you need to have a way to mark this in the image.
>
> The simplest way I can think of is to rename the methods for
> known-failing tests.  Right now, a method named testFoo is a unit
> test.  We could change that so that pendtestFoo is a pending unit test
> that is known not to work.
>
>
> -Lex
>
>
>
>

--
Howard Stearns
University of Wisconsin - Madison
Division of Information Technology
mailto:[hidden email]
jabber:[hidden email]
voice:+1-608-262-3724

12