On 2011-04-05, at 15:02, Nicolas Cellier wrote: > Better than average, take the median NO! thats why you provide the standard deviation... you should not optimize the results prematurely but provide numbers that let you decide on the quality of the benchmarks. what you can do is optimize for systematic errors, like warm up and stuff, but nevertheless you should provide the raw results... read the paper... > Nicolas > > 2011/4/5 Camillo Bruni <[hidden email]>: >> This is exactly why you have to provide some confidence interval / deviation, otherwise it is hard to make any reasonable conclusion. >> >> run it 100 times and take the average and provide the standard deviation. >> >> I am not a big fan of relying on incomplete benchmarking results: >> >> Please read: http://portal.acm.org/citation.cfm?id=1297033 >> >> http://www.squeaksource.com/p.html provides a basic benchmarking framework under the NBenchmark package. You subclass from PBenchmarkSuite implement a method #benchXXX and run it. >> >> r := PBFloat run: 100. >> r asString >> >> which will give decent results back :). This way it is much easier to make sense out of the numbers. >> >> So here again to remember: >> >> - number of samples >> - average run times >> - standard deviation >> >> If one of these results is missing the benchmark results are incomplete. >> >> best regards, >> camillo >> >> >> >> On 2011-04-05, at 13:56, Igor Stasenko wrote: >> >>> VariableNode initialize. >>> Compiler recompileAll. >>> >>> [ >>> TestCase allSubclasses do: [ :cls| >>> cls isAbstract >>> ifFalse: [cls suite run]]. >>> ] timeToRun >>> >>> 178938 >>> 183963 >>> >>> >>> >>> (ParseNode classVarNamed: 'StdSelectors') removeKey: #class ifAbsent: []. >>> Compiler recompileAll. >>> >>> [ >>> TestCase allSubclasses do: [ :cls| >>> cls isAbstract >>> ifFalse: [cls suite run]]. >>> ] timeToRun >>> >>> 187168 >>> 184992 >>> >>> the deviation is too big to see if its really so big overhead. >>> >>> if you compare worst , you got 187/178 ~ 5% >>> and if you compare the best you got >>> 184/183 ~ 0.5% >>> >>> >>> -- >>> Best regards, >>> Igor Stasenko AKA sig. >>> >> >> >> > |
In reply to this post by Camillo Bruni
I don't care :)
What i found that measurements confirms my previous expectation: the 'slowdown' of removing #class from set of optimized selectors lies in the range of macro-benchmark deviation. And running all tests for 100 times is not fun (it could take half of the day), and i'm not ready to leave my computer for half of the day and simply wait till it finish, because if i will do something else,then it will affect the performance, and measurements will be even more inaccurate. On 5 April 2011 14:53, Camillo Bruni <[hidden email]> wrote: > This is exactly why you have to provide some confidence interval / deviation, otherwise it is hard to make any reasonable conclusion. > > run it 100 times and take the average and provide the standard deviation. > > I am not a big fan of relying on incomplete benchmarking results: > > Please read: http://portal.acm.org/citation.cfm?id=1297033 > > http://www.squeaksource.com/p.html provides a basic benchmarking framework under the NBenchmark package. You subclass from PBenchmarkSuite implement a method #benchXXX and run it. > > r := PBFloat run: 100. > r asString > > which will give decent results back :). This way it is much easier to make sense out of the numbers. > > So here again to remember: > > - number of samples > - average run times > - standard deviation > > If one of these results is missing the benchmark results are incomplete. > > best regards, > camillo > > > > On 2011-04-05, at 13:56, Igor Stasenko wrote: > >> VariableNode initialize. >> Compiler recompileAll. >> >> [ >> TestCase allSubclasses do: [ :cls| >> cls isAbstract >> ifFalse: [cls suite run]]. >> ] timeToRun >> >> 178938 >> 183963 >> >> >> >> (ParseNode classVarNamed: 'StdSelectors') removeKey: #class ifAbsent: []. >> Compiler recompileAll. >> >> [ >> TestCase allSubclasses do: [ :cls| >> cls isAbstract >> ifFalse: [cls suite run]]. >> ] timeToRun >> >> 187168 >> 184992 >> >> the deviation is too big to see if its really so big overhead. >> >> if you compare worst , you got 187/178 ~ 5% >> and if you compare the best you got >> 184/183 ~ 0.5% >> >> >> -- >> Best regards, >> Igor Stasenko AKA sig. >> > > > -- Best regards, Igor Stasenko AKA sig. |
On 2011-04-05, at 15:11, Igor Stasenko wrote: > I don't care :) > > What i found that measurements confirms my previous expectation: > the 'slowdown' of removing #class from set of optimized selectors lies > in the range of macro-benchmark deviation. > > And running all tests for 100 times is not fun (it could take half of > the day), and i'm not ready to leave my computer for half of the day > and simply wait till it finish, because if i will do something > else,then it will affect the performance, and measurements will be > even more inaccurate. thats why you run it over night :) (or the day, depending on your sleep rhythm). or you reduce the number of runs to something at least bigger than 1 :D. > On 5 April 2011 14:53, Camillo Bruni <[hidden email]> wrote: >> This is exactly why you have to provide some confidence interval / deviation, otherwise it is hard to make any reasonable conclusion. >> >> run it 100 times and take the average and provide the standard deviation. >> >> I am not a big fan of relying on incomplete benchmarking results: >> >> Please read: http://portal.acm.org/citation.cfm?id=1297033 >> >> http://www.squeaksource.com/p.html provides a basic benchmarking framework under the NBenchmark package. You subclass from PBenchmarkSuite implement a method #benchXXX and run it. >> >> r := PBFloat run: 100. >> r asString >> >> which will give decent results back :). This way it is much easier to make sense out of the numbers. >> >> So here again to remember: >> >> - number of samples >> - average run times >> - standard deviation >> >> If one of these results is missing the benchmark results are incomplete. >> >> best regards, >> camillo >> >> >> >> On 2011-04-05, at 13:56, Igor Stasenko wrote: >> >>> VariableNode initialize. >>> Compiler recompileAll. >>> >>> [ >>> TestCase allSubclasses do: [ :cls| >>> cls isAbstract >>> ifFalse: [cls suite run]]. >>> ] timeToRun >>> >>> 178938 >>> 183963 >>> >>> >>> >>> (ParseNode classVarNamed: 'StdSelectors') removeKey: #class ifAbsent: []. >>> Compiler recompileAll. >>> >>> [ >>> TestCase allSubclasses do: [ :cls| >>> cls isAbstract >>> ifFalse: [cls suite run]]. >>> ] timeToRun >>> >>> 187168 >>> 184992 >>> >>> the deviation is too big to see if its really so big overhead. >>> >>> if you compare worst , you got 187/178 ~ 5% >>> and if you compare the best you got >>> 184/183 ~ 0.5% >>> >>> >>> -- >>> Best regards, >>> Igor Stasenko AKA sig. >>> >> >> >> > > > > -- > Best regards, > Igor Stasenko AKA sig. > |
In reply to this post by Igor Stasenko
On 05 Apr 2011, at 15:11, Igor Stasenko wrote: > I don't care :) *sigh* then just flip a coin and you have the same amount of insight with less effort. It is always astonishing how few people know about the basics when it comes to empirical experiments. I disagree with Camillo's demand for 100x repetition, but it should be definitely run at least a few times. And then you have to look at the results and see how they vary. Otherwise you don't know anything and are just wasting your own time. I will try at the next sprint to make some progress on the framework front. Perhaps, someone could trow together something like the unit-test runner to make at least the basics fool-proof. And at the same time the message to measure the execution time of a block should become deprecated or raise a warning that you are about to fool yourself... It does not have to be statistically rigorous for small experiments, but at the very least you have to convince yourself that your measurements are actually of any value. The computer in front of you is just to complex to make an educated guess. Best regards Stefan -- Stefan Marr Software Languages Lab Vrije Universiteit Brussel Pleinlaan 2 / B-1050 Brussels / Belgium http://soft.vub.ac.be/~smarr Phone: +32 2 629 2974 Fax: +32 2 629 3525 |
In reply to this post by Igor Stasenko
Running the all unit test is probably a better benchmark.
Alexandre On 5 Apr 2011, at 07:56, Igor Stasenko wrote: > VariableNode initialize. > Compiler recompileAll. > > [ > TestCase allSubclasses do: [ :cls| > cls isAbstract > ifFalse: [cls suite run]]. > ] timeToRun > > 178938 > 183963 > > > > (ParseNode classVarNamed: 'StdSelectors') removeKey: #class ifAbsent: []. > Compiler recompileAll. > > [ > TestCase allSubclasses do: [ :cls| > cls isAbstract > ifFalse: [cls suite run]]. > ] timeToRun > > 187168 > 184992 > > the deviation is too big to see if its really so big overhead. > > if you compare worst , you got 187/178 ~ 5% > and if you compare the best you got > 184/183 ~ 0.5% > > > -- > Best regards, > Igor Stasenko AKA sig. > -- _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: Alexandre Bergel http://www.bergel.eu ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. |
In reply to this post by Stefan Marr-4
On 5 April 2011 16:31, Stefan Marr <[hidden email]> wrote:
> > On 05 Apr 2011, at 15:11, Igor Stasenko wrote: > >> I don't care :) > *sigh* then just flip a coin and you have the same amount of insight with less effort. > > It is always astonishing how few people know about the basics when it comes to empirical experiments. > > I disagree with Camillo's demand for 100x repetition, but it should be definitely run at least a few times. And then you have to look at the results and see how they vary. Otherwise you don't know anything and are just wasting your own time. > that's what i did. i run two times and found that slowdown is at scale of deviation. This is why i don't care about exact slowdown anymore, because it is at the level of noise on macro benchmarks. > I will try at the next sprint to make some progress on the framework front. > Perhaps, someone could trow together something like the unit-test runner to make at least the basics fool-proof. > > And at the same time the message to measure the execution time of a block should become deprecated or raise a warning that you are about to fool yourself... > > It does not have to be statistically rigorous for small experiments, but at the very least you have to convince yourself that your measurements are actually of any value. > The computer in front of you is just to complex to make an educated guess. > > Best regards > Stefan > > -- > Stefan Marr > Software Languages Lab > Vrije Universiteit Brussel > Pleinlaan 2 / B-1050 Brussels / Belgium > http://soft.vub.ac.be/~smarr > Phone: +32 2 629 2974 > Fax: +32 2 629 3525 > > > -- Best regards, Igor Stasenko AKA sig. |
Depends at which signal/noise ratio you are ready to drop the signal...
Nicolas 2011/4/5 Igor Stasenko <[hidden email]>: > On 5 April 2011 16:31, Stefan Marr <[hidden email]> wrote: >> >> On 05 Apr 2011, at 15:11, Igor Stasenko wrote: >> >>> I don't care :) >> *sigh* then just flip a coin and you have the same amount of insight with less effort. >> >> It is always astonishing how few people know about the basics when it comes to empirical experiments. >> >> I disagree with Camillo's demand for 100x repetition, but it should be definitely run at least a few times. And then you have to look at the results and see how they vary. Otherwise you don't know anything and are just wasting your own time. >> > > that's what i did. i run two times and found that slowdown is at scale > of deviation. This is why i don't care about exact slowdown > anymore, because it is at the level of noise on macro benchmarks. > >> I will try at the next sprint to make some progress on the framework front. >> Perhaps, someone could trow together something like the unit-test runner to make at least the basics fool-proof. >> >> And at the same time the message to measure the execution time of a block should become deprecated or raise a warning that you are about to fool yourself... >> >> It does not have to be statistically rigorous for small experiments, but at the very least you have to convince yourself that your measurements are actually of any value. >> The computer in front of you is just to complex to make an educated guess. >> >> Best regards >> Stefan >> >> -- >> Stefan Marr >> Software Languages Lab >> Vrije Universiteit Brussel >> Pleinlaan 2 / B-1050 Brussels / Belgium >> http://soft.vub.ac.be/~smarr >> Phone: +32 2 629 2974 >> Fax: +32 2 629 3525 >> >> >> > > > > -- > Best regards, > Igor Stasenko AKA sig. > > |
In reply to this post by Camillo Bruni
Sorry, no time to read by now, so maybe I should better have shut up...
Nonetheless, it much depends on services provided by OS. If we only access the wall clock, then there will be noise due to other OS processes... ... but this noise is unlikely following a gaussian with 0 mean, does it ? In this case, it makes sense to filter out the outliers, and I would say this simple stupid method could possibly work: - take the median - and the standard deviation after filtering out a fixed percentage of outliers But you probably have more ellaborated and well thought method :) Nicolas 2011/4/5 Camillo Bruni <[hidden email]>: > > On 2011-04-05, at 15:02, Nicolas Cellier wrote: > >> Better than average, take the median > > NO! thats why you provide the standard deviation... > > you should not optimize the results prematurely but provide numbers that let you decide on the quality of the benchmarks. what you can do is optimize for systematic errors, like warm up and stuff, but nevertheless you should provide the raw results... > > read the paper... > >> Nicolas >> >> 2011/4/5 Camillo Bruni <[hidden email]>: >>> This is exactly why you have to provide some confidence interval / deviation, otherwise it is hard to make any reasonable conclusion. >>> >>> run it 100 times and take the average and provide the standard deviation. >>> >>> I am not a big fan of relying on incomplete benchmarking results: >>> >>> Please read: http://portal.acm.org/citation.cfm?id=1297033 >>> >>> http://www.squeaksource.com/p.html provides a basic benchmarking framework under the NBenchmark package. You subclass from PBenchmarkSuite implement a method #benchXXX and run it. >>> >>> r := PBFloat run: 100. >>> r asString >>> >>> which will give decent results back :). This way it is much easier to make sense out of the numbers. >>> >>> So here again to remember: >>> >>> - number of samples >>> - average run times >>> - standard deviation >>> >>> If one of these results is missing the benchmark results are incomplete. >>> >>> best regards, >>> camillo >>> >>> >>> >>> On 2011-04-05, at 13:56, Igor Stasenko wrote: >>> >>>> VariableNode initialize. >>>> Compiler recompileAll. >>>> >>>> [ >>>> TestCase allSubclasses do: [ :cls| >>>> cls isAbstract >>>> ifFalse: [cls suite run]]. >>>> ] timeToRun >>>> >>>> 178938 >>>> 183963 >>>> >>>> >>>> >>>> (ParseNode classVarNamed: 'StdSelectors') removeKey: #class ifAbsent: []. >>>> Compiler recompileAll. >>>> >>>> [ >>>> TestCase allSubclasses do: [ :cls| >>>> cls isAbstract >>>> ifFalse: [cls suite run]]. >>>> ] timeToRun >>>> >>>> 187168 >>>> 184992 >>>> >>>> the deviation is too big to see if its really so big overhead. >>>> >>>> if you compare worst , you got 187/178 ~ 5% >>>> and if you compare the best you got >>>> 184/183 ~ 0.5% >>>> >>>> >>>> -- >>>> Best regards, >>>> Igor Stasenko AKA sig. >>>> >>> >>> >>> >> > > > |
Hi:
On 05 Apr 2011, at 21:27, Nicolas Cellier wrote: > But you probably have more ellaborated and well thought method :) I think we are talking here mostly about quick and dirty benchmarks to confirm intuitions. The provable best and most cost effective way to do it is: flip a coin (and that is serious, has the same or even better reliability and quality of result) If you think that is not good enough, well then the absolute minimum is something like this: - run your benchmark 5 times - 5 times with your optimization - 5 times without - look at the collected raw values i.e. measure runtime - and use your gut feeling to decide whether you have meaningful values and whether your time to do the optimization was well spent (don't eliminate outliers) If that is still to much to do, well, again, just flip a coin, and you have a high chance that the result of that are actually more reliable. (the approach above does not account for systematic errors...) However, once you got your optimization into the VM your favorite build server should do its thing and that should include performance regression testing which is using a "slightly" more sophisticated approach than the described one. Best regards Stefan -- Stefan Marr Software Languages Lab Vrije Universiteit Brussel Pleinlaan 2 / B-1050 Brussels / Belgium http://soft.vub.ac.be/~smarr Phone: +32 2 629 2974 Fax: +32 2 629 3525 |
Hi,
On 5 April 2011 23:26, Stefan Marr <[hidden email]> wrote: > However, once you got your optimization into the VM your favorite build server should do its thing and that should include performance regression testing which is using a "slightly" more sophisticated approach than the described one. :-) this one is "ever so slightly" more sophisticated: "Wake up and smell the coffee: evaluation methodology for the 21st century" by Blackburn et al. - PDF available from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.1635&rep=rep1&type=pdf Best, Michael |
In reply to this post by Camillo Bruni
On 05 Apr 2011, at 14:53, Camillo Bruni wrote: > Please read: http://portal.acm.org/citation.cfm?id=1297033 I would love to read this, but why should I have to pay for this ? Research, especially when it is paid for with public money, should be free for all. Sven |
On 06 Apr 2011, at 00:01, Sven Van Caekenberghe wrote: > > On 05 Apr 2011, at 14:53, Camillo Bruni wrote: > >> Please read: http://portal.acm.org/citation.cfm?id=1297033 > > I would love to read this, but why should I have to pay for this ? > Research, especially when it is paid for with public money, should be free for all. http://lmgtfy.com/?q=Statistically+rigorous+java+performance+evaluation ... > Sven In case google is only my friend: http://itkovian.net/base/files/papers/oopsla2007-georges-preprint.pdf -- Stefan Marr Software Languages Lab Vrije Universiteit Brussel Pleinlaan 2 / B-1050 Brussels / Belgium http://soft.vub.ac.be/~smarr Phone: +32 2 629 2974 Fax: +32 2 629 3525 |
Free forum by Nabble | Edit this page |