(Copying squeak-dev too).
I'm not sold on the whole test timeout thing. When I run tests, I want to know the answer to the question, "is the software working?" Putting a timeout on tests trades a slower, but definitive, "yes" or "no" for a supposedly-faster "maybe". But is getting a "maybe" back really faster? I've just incurred the cost of running a test suite, but left without my answer. I get a "maybe", what am I supposed to do next? Find a faster machine? Hack into the code to fiddle with a timeout pragma? That's not faster.. But, the reason given for the change was not for running tests interactively (the 99% case), rather, all tests form the beginning of time are now saddled with a timeout for the 1% case: "The purpose of the timeout is to catch issues like infinite loops, unexpected user input etc. in automated test environments." If tests are supposed to be quick (and deterministic) anyway, wouldn't an infinite loop or user-input be caught the first time the test was run (interactively)? Seriously, when you make software changes, we run the tests interactively first, and then the purpose of night-time automated test environment is to catch regressions on the merged code.. In that case, the high-level test-controller which spits out the results could and should be responsible for handling "unexpected user input" and/or putting in a timeout, not each and every last test method.. IMO, we want short tests, so let's just write them to be short. If they're too long, then the encouragement to shorten them comes from our own impatience of running them interactively. Running them in batch at night requires no patience, because we're sleeping, and besides, the batch processor should take responsibility for handling those rare scenarios at a higher-level.. Regards, Chris On Sat, May 29, 2010 at 2:53 AM, stephane ducasse <[hidden email]> wrote: > Hi guys > > in Squeak andreas introduced the idea of test time out > Do you think that this is interesting? > > Stef > > SUnit > ----- > All test cases now have an associated timeout after which the test is considered failed. The purpose of the timeout is to catch issues like infinite loops, unexpected user input etc. in automated test environments. Timeouts can be set on an individual test basis using the <timeout: seconds> tag or for an entire test case by implementing the #defaultTimeout method. > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > |
Well put.
Sent from my iPhone On May 30, 2010, at 11:52, Chris Muller <[hidden email]> wrote: > (Copying squeak-dev too). > > I'm not sold on the whole test timeout thing. When I run tests, I > want to know the answer to the question, "is the software working?" > > Putting a timeout on tests trades a slower, but definitive, "yes" or > "no" for a supposedly-faster "maybe". But is getting a "maybe" back > really faster? I've just incurred the cost of running a test suite, > but left without my answer. I get a "maybe", what am I supposed to do > next? Find a faster machine? Hack into the code to fiddle with a > timeout pragma? That's not faster.. > > But, the reason given for the change was not for running tests > interactively (the 99% case), rather, all tests form the beginning of > time are now saddled with a timeout for the 1% case: > > "The purpose of the timeout is to catch issues like infinite loops, > unexpected user input etc. in automated test environments." > > If tests are supposed to be quick (and deterministic) anyway, wouldn't > an infinite loop or user-input be caught the first time the test was > run (interactively)? Seriously, when you make software changes, we > run the tests interactively first, and then the purpose of night-time > automated test environment is to catch regressions on the merged > code.. > > In that case, the high-level test-controller which spits out the > results could and should be responsible for handling "unexpected user > input" and/or putting in a timeout, not each and every last test > method.. > > IMO, we want short tests, so let's just write them to be short. If > they're too long, then the encouragement to shorten them comes from > our own impatience of running them interactively. Running them in > batch at night requires no patience, because we're sleeping, and > besides, the batch processor should take responsibility for handling > those rare scenarios at a higher-level.. > > Regards, > Chris > > > On Sat, May 29, 2010 at 2:53 AM, stephane ducasse > <[hidden email]> wrote: >> Hi guys >> >> in Squeak andreas introduced the idea of test time out >> Do you think that this is interesting? >> >> Stef >> >> SUnit >> ----- >> All test cases now have an associated timeout after which the test >> is considered failed. The purpose of the timeout is to catch issues >> like infinite loops, unexpected user input etc. in automated test >> environments. Timeouts can be set on an individual test basis using >> the <timeout: seconds> tag or for an entire test case by >> implementing the #defaultTimeout method. >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> > |
In reply to this post by Chris Muller-3
Hi Chris -
Let me comment on this from a more general point of view first, before going into the specifics. I've spent the last five years building a distributed system and during this time I've learned a couple of things about the value of timeouts :-) One thing that I've come to understand is that *no* operation is unbounded. We may leisurely talk about "just wait until it's done" but the reality is that regardless of what the operation is we never actually wait forever. At some point we *will* give up no matter what you may think. This is THE fundamental point here. Everything else is basically haggling about what the right timeout is. For the right timeout the second fundamental thing to understand is that if there's a question of whether the operation "maybe" completed, then your timeout is too short. Period. The timeout's value is not to indicate that "maybe" the operation completed, it is there to say unequivocally that something caused it to not complete and that it DID fail. Obviously, introducing timeouts will create some initial false positives. But it may be interesting to be a bit more precise on what we're talking about. To do this I attributed TestRunner to measure the time it takes to run each test and then ran all the tests in 4.2 to see where that leads us. As you might expect, the distribution is extremely uneven. Out of 2681 tests run 2588 execute in < 500 msecs (approx. 1800 execute with no measurable time); 2630 execute in less than one second, leaving a total of 51 that take more than a second and only three tests actually take longer than 5 seconds and they are all tagged as such. As you can see the vast majority of tests have a "safety margin" of 10x or more between the time the test usually takes and its timeout value. Generally speaking, this margin is sufficient to compensate for "other" effects that might rightfully delay the completion of the test in time. If you have tests that commonly vary by 10x I'd be interested in finding out more about what makes them so unpredictable. So if your question is "are my timeouts to tight" one thing we could do is to introduce the 10x as a more or less general guideline for executing tests, and perhaps add a transcript notifier if we ever come closer than 1/3rd of the specified timeout value (i.e., indicating that something in the nature of the test has changed that should be reflected in its timeout). This would give you ample warning that you need to adjust your test even if it isn't (yet) failing on the timeout. That said, a couple of concrete comments to your post: On 5/30/2010 11:52 AM, Chris Muller wrote: > (Copying squeak-dev too). > > I'm not sold on the whole test timeout thing. When I run tests, I > want to know the answer to the question, "is the software working?" Correct. > Putting a timeout on tests trades a slower, but definitive, "yes" or > "no" for a supposedly-faster "maybe". But is getting a "maybe" back > really faster? I've just incurred the cost of running a test suite, > but left without my answer. I get a "maybe", what am I supposed to do > next? Find a faster machine? Hack into the code to fiddle with a > timeout pragma? That's not faster.. See above. If you're thinking "maybe", then the timeout is too short. > But, the reason given for the change was not for running tests > interactively (the 99% case), rather, all tests form the beginning of > time are now saddled with a timeout for the 1% case: As the data shows, this is already the case. It may be interesting to note that so far there were a total of 5 (five) places that had to be adjusted in Squeak. One was a general place (the default timeout for the decompiler tests) and four were individual methods. Considering that computers usually don't become slower over time, it seems unlikely that further adjustments will be necessary here. So the bottom line is that the changes required aren't exactly excessive. > "The purpose of the timeout is to catch issues like infinite loops, > unexpected user input etc. in automated test environments." > > If tests are supposed to be quick (and deterministic) anyway, wouldn't > an infinite loop or user-input be caught the first time the test was > run (interactively)? Seriously, when you make software changes, we > run the tests interactively first, and then the purpose of night-time > automated test environment is to catch regressions on the merged > code. These changes are largely intended for automated integration testing. I am hoping to automate the tests for community supported packages to a point where there will be no user in front of the system. Even if there were, it's not clear whether that person can fix the issue immediately or whether the entire process is stuck because someone can momentarily not fix the problem at hand and the tests will never run to completion and produce any useful result. So the idea here is not that unit tests are *only* to catch regressions in previously manually tested (combinations of) code. The idea is to catch interactions, and integration bugs and be able to produce a result even if there is no user to watch the particular combination of packages being loaded together in this particular form. Perhaps that is our problem here? It seems to me that you're taking a view that says unit tests are exclusively for regression testing and consequently there is no way a previously successful test would suddenly become unsuccessful in a way that makes it time out ... but you know, having written this sentence, it makes no sense to me. If we'd know beforehand that tests fail only in particular known ways we wouldn't have to run them to begin with. The whole idea of running the tests to catch *unexpected* situations and as a consequence there is value of capturing these situations instead of hanging and producing no useful result. > In that case, the high-level test-controller which spits out the > results could and should be responsible for handling "unexpected user > input" and/or putting in a timeout, not each and every last test > method.. Do you have such a "high-level test-controller"? Or do you mean a human being spending their time watching the tests run to completion? If the former, I'm curious as to how it would differ from what I did. If the latter, are you volunteering? ;-) > IMO, we want short tests, so let's just write them to be short. If > they're too long, then the encouragement to shorten them comes from > our own impatience of running them interactively. Running them in > batch at night requires no patience, because we're sleeping, and > besides, the batch processor should take responsibility for handling > those rare scenarios at a higher-level.. The goal for the timeouts is *not* to cause you to write shorter tests. If you're looking at it this way you're looking at it from the wrong angle. Up your timeout to whatever you feel is sensible to have trust in the results of the tests. As I said earlier, I'm quite happy to discuss the default timeout; it's simply that with some 95% coverage on a 10x safety margin it feels to me that we're playing it safe enough for the remaining cases to have explicit timeouts. Cheers, - Andreas |
Thanks for clarifying your goals w.r.t. introducing the timeout. I
think that's important because, as I've said, legacy tests that live in external packages are affected. I read your whole note a few times, and one part in particular stuck out to me as a potentially useful use-case for test-case timeout: > These changes are largely intended for automated integration testing. I am > hoping to automate the tests for community supported packages to a point > where there will be no user in front of the system. If, by this, you mean you want to simply have a headless running squeak image which: [ true ] whileTrue: [ loadLatestPackageCombinations. runTestSuite. mailResultsToSqueakDev ] THEN, that brings us down to only haggling about the default timeout, although I still would prefer to handle timeout it at a higher level.. If, however, this isn't the goal, then I still don't seem to have grasped, what I sense is, some key point.. or that my own concerns were properly understood. If so, let me try one more time. :) > done" but the reality is that regardless of what the operation is we never > actually wait forever. At some point we *will* give up no matter what you > may think. This is THE fundamental point here. Everything else is basically > haggling about what the right timeout is. Of course we would "give up" after an unreasonable amount of time. In either case, there is something to interrogate, either a live looping test-runner machine, or a static report of test results with one or more that say, "timed out". In the former case, we have a bevy of useful information, (e.g., which test is it trying to run? How much memory is the test image using right now? Can I Alt+. interrupt it and get even more information?) In the latter case, there is no choice but to start at square 1: Try to recreate the problem. (What if it works?) Personally, I would always prefer to deal with the former case than the latter.. > For the right timeout the second fundamental thing to understand is that if > there's a question of whether the operation "maybe" completed, then your > timeout is too short. Period. The timeout's value is not to indicate that > "maybe" the operation completed, it is there to say unequivocally that > something caused it to not complete and that it DID fail. I didn't understand this. There is no question about "maybe completed". We know if a test times out then it _didn't_ complete. The "maybe" I referred to was about the core question: whether the underlying software being tested can be used or not. "Maybe" it could, then again, maybe it shouldn't. It sounds like we agree, a timeout would *have* to be regarded as a failure. > Obviously, introducing timeouts will create some initial false positives. You mean false negatives? If we are saying that we must treat a timeout as failure, and failure is "negative", then a timeout would be false negative or a true negative....? > But it may be interesting to be a bit more precise on what we're talking > about. To do this I attributed TestRunner to measure the time it takes to > run each test and then ran all the tests in 4.2 to see where that leads us. > As you might expect, the distribution is extremely uneven. Out of 2681 tests > run 2588 execute in < 500 msecs (approx. 1800 execute with no measurable > time); 2630 execute in less than one second, leaving a total of 51 that > take more than a second and only three tests actually take longer than 5 > seconds and they are all tagged as such. That's fine for the 4.2 tests, but there are hundreds of tests in external packages. With a mere 5-second default, many will need to be updated with a pragma. But then we're talking about a branch in the package because that won't be backward compatible with 3.9, will it? > As you can see the vast majority of tests have a "safety margin" of 10x or > more between the time the test usually takes and its timeout value. > Generally speaking, this margin is sufficient to compensate for "other" > effects that might rightfully delay the completion of the test in time. I can see that jacking up the timeout may tend reduce the number of false negatives (at the expense of potentially longer wait times!), but when they do, we have no useful information whatsoever. Not even certainty whether the underlying software is usable or not, because it could be a false negative. > If > you have tests that commonly vary by 10x I'd be interested in finding out > more about what makes them so unpredictable. Well, again, it's not just about randomness in the tests but also about external factors; CPU speed, current system load, etc. > So if your question is "are my timeouts to tight" one thing we could do is > to introduce the 10x as a more or less general guideline for executing > tests, Ok, with that kind of margin, the message I'm getting from you is that it does about making a human have to wait. We just want to make sure we "get some kind of report?" >> But, the reason given for the change was not for running tests >> interactively (the 99% case), rather, all tests form the beginning of >> time are now saddled with a timeout for the 1% case: > > As the data shows, this is already the case. It may be interesting to note > that so far there were a total of 5 (five) places that had to be adjusted in > Squeak. I'm not worried about the built-in tests; recall I acknowledged that I can "almost understand" a forced timeout in the context of an open-source project where people are all contributing their portions and no one else wants to be "held up" because of one persons tests looping. My concern is more about the impact to legacy external packages.. > One was a general place (the default timeout for the decompiler > tests) and four were individual methods. Considering that computers usually > don't become slower over time, it seems unlikely that further adjustments > will be necessary here. Well, they do.. It's not just a function of time, but who's running it, and on which machine. We all have different machines. Maybe someone wants to test on an iPhone that might be considerably slower than the original desktop on which the timeout was specified... > So the bottom line is that the changes required > aren't exactly excessive. That depends on whether, to have an Community Supported Package be included, how many test methods I have and whether I also want that to run in 3.9 and whether, to do that, I have to put in a pragma.. (unless I'm mistaken about pragmas working in 3.9). Bottom line: Today Magma runs on 3.9 - 4.2 + Pharo. Some of Magma's tests necessarily take several minutes. Question: Can Magma be a CSP and still retain this wide compatibility? > These changes are largely intended for automated integration testing. I am > hoping to automate the tests for community supported packages to a point > where there will be no user in front of the system. > > Even if there were, it's > not clear whether that person can fix the issue immediately or whether the > entire process is stuck because someone can momentarily not fix the problem > at hand and the tests will never run to completion and produce any useful > result. Who is "that person" and what is their role? > begin with. The whole idea of running the tests to catch *unexpected* > situations and as a consequence there is value of capturing these situations > instead of hanging and producing no useful result. To me, "timed out" is what is not useful. To find a hanging machine that can be interrogated is much more useful. >> In that case, the high-level test-controller which spits out the >> results could and should be responsible for handling "unexpected user >> input" and/or putting in a timeout, not each and every last test >> method.. > > Do you have such a "high-level test-controller"? Or do you mean a human > being spending their time watching the tests run to completion? If the > former, I'm curious as to how it would differ from what I did. If the > latter, are you volunteering? ;-) I meant the former. It differs from what you did in that it preserves legacy compatibilty, and the legacy deterministic property of testing. To handle automated test server, I would handle the on-timeout: from a much higher place, and therefore it would not be for individual tests, but for the whole suite. Information about the last running test would be sufficient for me, especially if we're talking about all of the other disadvantages I've mentioned for fine-grained timeouts.. - Chris |
Usually in a test, "false positive" is when the test thinks it found a bug, but there's actually something wrong with the test. "False negative" usually means that a test erroneously passed when it shouldn't have. Of course I am probably speaking a regional dialect which may be somewhat rooted in Seattle, WA test culture:)
On Jun 2, 2010, at 5:09 PM, Chris Muller <[hidden email]> wrote: > Thanks for clarifying your goals w.r.t. introducing the timeout. I > think that's important because, as I've said, legacy tests that live > in external packages are affected. > > I read your whole note a few times, and one part in particular stuck > out to me as a potentially useful use-case for test-case timeout: > >> These changes are largely intended for automated integration testing. I am >> hoping to automate the tests for community supported packages to a point >> where there will be no user in front of the system. > > If, by this, you mean you want to simply have a headless running > squeak image which: > > [ true ] whileTrue: > [ loadLatestPackageCombinations. > runTestSuite. > mailResultsToSqueakDev ] > > THEN, that brings us down to only haggling about the default timeout, > although I still would prefer to handle timeout it at a higher level.. > > If, however, this isn't the goal, then I still don't seem to have > grasped, what I sense is, some key point.. or that my own concerns > were properly understood. If so, let me try one more time. :) > >> done" but the reality is that regardless of what the operation is we never >> actually wait forever. At some point we *will* give up no matter what you >> may think. This is THE fundamental point here. Everything else is basically >> haggling about what the right timeout is. > > Of course we would "give up" after an unreasonable amount of time. In > either case, there is something to interrogate, either a live looping > test-runner machine, or a static report of test results with one or > more that say, "timed out". > > In the former case, we have a bevy of useful information, (e.g., which > test is it trying to run? How much memory is the test image using > right now? Can I Alt+. interrupt it and get even more information?) > > In the latter case, there is no choice but to start at square 1: Try > to recreate the problem. (What if it works?) > > Personally, I would always prefer to deal with the former case than the latter.. > >> For the right timeout the second fundamental thing to understand is that if >> there's a question of whether the operation "maybe" completed, then your >> timeout is too short. Period. The timeout's value is not to indicate that >> "maybe" the operation completed, it is there to say unequivocally that >> something caused it to not complete and that it DID fail. > > I didn't understand this. There is no question about "maybe > completed". We know if a test times out then it _didn't_ complete. > The "maybe" I referred to was about the core question: whether the > underlying software being tested can be used or not. "Maybe" it > could, then again, maybe it shouldn't. It sounds like we agree, a > timeout would *have* to be regarded as a failure. > >> Obviously, introducing timeouts will create some initial false positives. > > You mean false negatives? If we are saying that we must treat a > timeout as failure, and failure is "negative", then a timeout would be > false negative or a true negative....? > >> But it may be interesting to be a bit more precise on what we're talking >> about. To do this I attributed TestRunner to measure the time it takes to >> run each test and then ran all the tests in 4.2 to see where that leads us. >> As you might expect, the distribution is extremely uneven. Out of 2681 tests >> run 2588 execute in < 500 msecs (approx. 1800 execute with no measurable >> time); 2630 execute in less than one second, leaving a total of 51 that >> take more than a second and only three tests actually take longer than 5 >> seconds and they are all tagged as such. > > That's fine for the 4.2 tests, but there are hundreds of tests in > external packages. With a mere 5-second default, many will need to be > updated with a pragma. But then we're talking about a branch in the > package because that won't be backward compatible with 3.9, will it? > >> As you can see the vast majority of tests have a "safety margin" of 10x or >> more between the time the test usually takes and its timeout value. >> Generally speaking, this margin is sufficient to compensate for "other" >> effects that might rightfully delay the completion of the test in time. > > I can see that jacking up the timeout may tend reduce the number of > false negatives (at the expense of potentially longer wait times!), > but when they do, we have no useful information whatsoever. Not even > certainty whether the underlying software is usable or not, because it > could be a false negative. > >> If >> you have tests that commonly vary by 10x I'd be interested in finding out >> more about what makes them so unpredictable. > > Well, again, it's not just about randomness in the tests but also > about external factors; CPU speed, current system load, etc. > >> So if your question is "are my timeouts to tight" one thing we could do is >> to introduce the 10x as a more or less general guideline for executing >> tests, > > Ok, with that kind of margin, the message I'm getting from you is that > it does about making a human have to wait. We just want to make sure > we "get some kind of report?" > >>> But, the reason given for the change was not for running tests >>> interactively (the 99% case), rather, all tests form the beginning of >>> time are now saddled with a timeout for the 1% case: >> >> As the data shows, this is already the case. It may be interesting to note >> that so far there were a total of 5 (five) places that had to be adjusted in >> Squeak. > > I'm not worried about the built-in tests; recall I acknowledged that I > can "almost understand" a forced timeout in the context of an > open-source project where people are all contributing their portions > and no one else wants to be "held up" because of one persons tests > looping. > > My concern is more about the impact to legacy external packages.. > >> One was a general place (the default timeout for the decompiler >> tests) and four were individual methods. Considering that computers usually >> don't become slower over time, it seems unlikely that further adjustments >> will be necessary here. > > Well, they do.. It's not just a function of time, but who's running > it, and on which machine. We all have different machines. Maybe > someone wants to test on an iPhone that might be considerably slower > than the original desktop on which the timeout was specified... > >> So the bottom line is that the changes required >> aren't exactly excessive. > > That depends on whether, to have an Community Supported Package be > included, how many test methods I have and whether I also want that to > run in 3.9 and whether, to do that, I have to put in a pragma.. > (unless I'm mistaken about pragmas working in 3.9). > > Bottom line: Today Magma runs on 3.9 - 4.2 + Pharo. Some of Magma's > tests necessarily take several minutes. > > Question: Can Magma be a CSP and still retain this wide compatibility? > >> These changes are largely intended for automated integration testing. I am >> hoping to automate the tests for community supported packages to a point >> where there will be no user in front of the system. >> >> Even if there were, it's >> not clear whether that person can fix the issue immediately or whether the >> entire process is stuck because someone can momentarily not fix the problem >> at hand and the tests will never run to completion and produce any useful >> result. > > Who is "that person" and what is their role? > >> begin with. The whole idea of running the tests to catch *unexpected* >> situations and as a consequence there is value of capturing these situations >> instead of hanging and producing no useful result. > > To me, "timed out" is what is not useful. To find a hanging machine > that can be interrogated is much more useful. > >>> In that case, the high-level test-controller which spits out the >>> results could and should be responsible for handling "unexpected user >>> input" and/or putting in a timeout, not each and every last test >>> method.. >> >> Do you have such a "high-level test-controller"? Or do you mean a human >> being spending their time watching the tests run to completion? If the >> former, I'm curious as to how it would differ from what I did. If the >> latter, are you volunteering? ;-) > > I meant the former. It differs from what you did in that it preserves > legacy compatibilty, and the legacy deterministic property of testing. > To handle automated test server, I would handle the on-timeout: from > a much higher place, and therefore it would not be for individual > tests, but for the whole suite. Information about the last running > test would be sufficient for me, especially if we're talking about all > of the other disadvantages I've mentioned for fine-grained timeouts.. > > - Chris > |
In reply to this post by Chris Muller-3
I completely agree
On Sun, May 30, 2010 at 2:52 PM, Chris Muller <[hidden email]> wrote: (Copying squeak-dev too). |
In reply to this post by Andreas.Raab
On Tue, Jun 01, 2010 at 09:36:48PM -0700, Andreas Raab wrote:
> > These changes are largely intended for automated integration testing. I > am hoping to automate the tests for community supported packages to a > point where there will be no user in front of the system. I've run into one issue for externally supported packages that need to work on older images. The <timeout: 30> method annotation works very well, but is not supported on all images. I put SUnit-dtl.79 in the inbox as a possible solution. Dave |
Free forum by Nabble | Edit this page |