Hi there,
After more than 2 years of (time-to-time) development and about that much time of use, I'd like to announce CalipeL, a tool for benchmarking and monitoring performance regressions. The basic ideas that drove the development: * Benchmarking and (especially) interpreting benchmark results is always a monkey business. The tool should produce raw numbers, letting the user to use whichever statistics she need to make up (desired) results. * Benchmark results should be kept and managed at a single place so one can view and retrieve all past benchmark results pretty much the same way as one can view and retrieve past versions of the software from a source code management tool. Features: - simple - creating a benchmark is as simple as writing a method in a class - flexible - a special set-up and/or warm-up routines could be specified at benchmark-level as well as set of parameters to allow fine-grained measurements under different conditions - batch runner - contains a batch runner allowing one to run benchmarks from a command line or at CI servers such as Jenkins. - web - comes with simple web interface to gather and process benchmark results. However, the web application would deserve some more work. Repository: https://bitbucket.org/janvrany/jv-calipel http://smalltalkhub.com/#!/~JanVrany/CalipeL-S (read-only export from the above and Pharo-specific code) More information: https://bitbucket.org/janvrany/jv-calipel/wiki/Home I have been using CalipeL for benchmarking and keeping track of performance of Smalltalk/X VM, STX:LIBJAVA, a PetitParser compiler and other code I was working over the time. Finally, I'd like to thank to Marcel Hlopko for his work on the web application and Jan Kurs for his comments. I hope some of you may find it useful. If you have any comments or questions, do not hesitate and let me know! Regards, Jan |
Hi Jan,
That looks pretty cool! We use SMark (http://smalltalkhub.com/#!/~PharoExtras/SMark) for benchmarking and CI integration for Fuel. If you know SMark, could you give me an idea of what the differences are? Cheers, Max > On 23 Oct 2015, at 10:47, Jan Vrany <[hidden email]> wrote: > > Hi there, > > After more than 2 years of (time-to-time) development and about > that much time of use, I'd like to announce CalipeL, a tool for > benchmarking and monitoring performance regressions. > > The basic ideas that drove the development: > > * Benchmarking and (especially) interpreting benchmark results > is always a monkey business. The tool should produce raw numbers, > letting the user to use whichever statistics she need to make up > (desired) results. > * Benchmark results should be kept and managed at a single place so > one can view and retrieve all past benchmark results pretty much > the same way as one can view and retrieve past versions of > the software from a source code management tool. > > Features: > > - simple - creating a benchmark is as simple as writing a method > in a class > - flexible - a special set-up and/or warm-up routines could be > specified at benchmark-level as well as set of parameters > to allow fine-grained measurements under different conditions > - batch runner - contains a batch runner allowing one to run > benchmarks from a command line or at CI servers such as Jenkins. > - web - comes with simple web interface to gather and process > benchmark results. However, the web application would deserve > some more work. > > Repository: > > https://bitbucket.org/janvrany/jv-calipel > > http://smalltalkhub.com/#!/~JanVrany/CalipeL-S (read-only export > from the above and Pharo-specific code) > > More information: > > https://bitbucket.org/janvrany/jv-calipel/wiki/Home > > I have been using CalipeL for benchmarking and keeping track of > performance of Smalltalk/X VM, STX:LIBJAVA, a PetitParser compiler > and other code I was working over the time. > > Finally, I'd like to thank to Marcel Hlopko for his work on the > web application and Jan Kurs for his comments. > > I hope some of you may find it useful. If you have any comments > or questions, do not hesitate and let me know! > > Regards, Jan > |
Hi Max,
I looked at some version of SMark years ago and never used it extensively, so I might be wrong, but: * SMark executor does some magic with numbers. It tries to calculate a number of iterations to run in order to get "statistically meaningful results". Maybe it's me, but I could not fully understand what it does and why it does it so. CalipeL does no magic - it gives you raw numbers (no average, no mean, rather a sequence of measurements). It's up to the one who processes and interprets the data to use whatever method she likes (and whichever gives the numbers she'd like to see :-) This transparency was important for our needs. * SMark, IIRC, requires benchmarks to inherit from some base class (like SUnit). Also, not sure if SMark allows you to specify a warmup- phase (handy for example to measure peak performance when caches are filled or so). CalipeL, OTOH, uses method annotations to describe the benchmark, so one can turn a regular SUnit test method into benchmark as simply as annotating it with <benchmark>. A warmup method and setup/teardown methods can be specified per-benchmark. * SMark has no support for parametrization. In Calipel, support for benchmark parameters was one of the requirements from the very beginning. A little example: I had to optimize performance of Object>>perform: family of methods for they was thought to be slowish. I came up with several variants of of a "better" implementation, no knowing which one is the best. How does each of them behave under different workloads? Like - how the number of distinct receivier classes affects the performance? How the number of distinct selectors affects the performance? Is the performance different when receiver classes are distributed uniformly or normally (which seems to be more common case). Same for selectors? Is 256 row, 2-way associative cache better than 128 rows, 4-way associative? You have number of parameters, for each parameter you define a number of values and CalipeL work out all possible combinations and run benchmarks using each. Without parametrization, the number of benchmark methods would grow exponentially, making hard to experiment with different setups. For me, this is one of key things. * SMark measures time only. CalipeL measures time, too, but has the facility to provide a user-defined "measurement instrument", which can be anything (what can be measured, indeed). For example, for some web application the execution time might not be that useful, perhaps a number of SQL queries it makes is more important. No problem, define your own measurement instrument and tell CalipeL to use it in addition to time, number of GCs, you name it. All results of all instruments are part of machine-readable report, indeed. * SMark had no support for "system" profilers and similar. CalipeL integrates with systemtap/dtrace and cachegrind so one can have a full profile, including VM code and see things like L1/L2 I/D cache misses, mispredicted branches or count events like context switches, monitor signaling, context evacuation. Useful only for VM engineers I think, but I cannot image doing my work without this. Available only for Smalltalk/X, but should not be a big deal adding this to Pharo (simple plugin would do it, IMO) * Finally, SMark spits out a report and that's it. CalipeL, OTOH, goes beyond that. It tries for provide tools to gather, store and query results in a centralised way so nothing is forgotten. (no more: hmm, where are the results of #perform: benchmarks I run three months ago? Is it this file? Or that file? Or did I deleted them when my laptop run out of disk space?) And yes, I know that in this area there's a lot of space for improvements. What we have now is certainly not ideal, to put it mildly :-) Hope that gives you the idea. Jan On Sun, 2015-11-01 at 12:11 +0100, Max Leske wrote: > Hi Jan, > > That looks pretty cool! > We use SMark (http://smalltalkhub.com/#!/~PharoExtras/SMark) for > benchmarking and CI integration for Fuel. If you know SMark, could > you give me an idea of what the differences are? > > Cheers, > Max > > > > On 23 Oct 2015, at 10:47, Jan Vrany <[hidden email]> wrote: > > > > Hi there, > > > > After more than 2 years of (time-to-time) development and about > > that much time of use, I'd like to announce CalipeL, a tool for > > benchmarking and monitoring performance regressions. > > > > The basic ideas that drove the development: > > > > * Benchmarking and (especially) interpreting benchmark results > > is always a monkey business. The tool should produce raw numbers, > > letting the user to use whichever statistics she need to make up > > (desired) results. > > * Benchmark results should be kept and managed at a single place so > > one can view and retrieve all past benchmark results pretty much > > the same way as one can view and retrieve past versions of > > the software from a source code management tool. > > > > Features: > > > > - simple - creating a benchmark is as simple as writing a method > > in a class > > - flexible - a special set-up and/or warm-up routines could be > > specified at benchmark-level as well as set of parameters > > to allow fine-grained measurements under different conditions > > - batch runner - contains a batch runner allowing one to run > > benchmarks from a command line or at CI servers such as Jenkins. > > - web - comes with simple web interface to gather and process > > benchmark results. However, the web application would deserve > > some more work. > > > > Repository: > > > > https://bitbucket.org/janvrany/jv-calipel > > > > http://smalltalkhub.com/#!/~JanVrany/CalipeL-S (read-only export > > from the above and Pharo-specific code) > > > > More information: > > > > https://bitbucket.org/janvrany/jv-calipel/wiki/Home > > > > I have been using CalipeL for benchmarking and keeping track of > > performance of Smalltalk/X VM, STX:LIBJAVA, a PetitParser compiler > > and other code I was working over the time. > > > > Finally, I'd like to thank to Marcel Hlopko for his work on the > > web application and Jan Kurs for his comments. > > > > I hope some of you may find it useful. If you have any comments > > or questions, do not hesitate and let me know! > > > > Regards, Jan > > > > > |
> On 01 Nov 2015, at 23:45, Jan Vrany <[hidden email]> wrote: > > Hi Max, > > I looked at some version of SMark years ago and never used > it extensively, so I might be wrong, but: > > * SMark executor does some magic with numbers. It tries to > calculate a number of iterations to run in order to get > "statistically meaningful results". Maybe it's me, but > I could not fully understand what it does and why it does it > so. > CalipeL does no magic - it gives you raw numbers (no average, no > mean, > rather a sequence of measurements). It's up to the one who processes > and > interprets the data to use whatever method she likes (and whichever > gives the numbers she'd like to see :-) This transparency was > important for our needs. > > * SMark, IIRC, requires benchmarks to inherit from some base class > (like SUnit). Also, not sure if SMark allows you to specify a warmup- > phase (handy for example to measure peak performance when caches are > filled or so). > CalipeL, OTOH, uses method annotations to describe the benchmark, > so one can turn a regular SUnit test method into benchmark as simply > as annotating it with <benchmark>. A warmup method and setup/teardown > methods can be specified per-benchmark. > > * SMark has no support for parametrization. > In Calipel, support for benchmark parameters was one of the > requirements from the very beginning. A little example: > I had to optimize performance of Object>>perform: family of methods > for they was thought to be slowish. I came up with several > variants of of a "better" implementation, no knowing which one > is the best. How does each of them behave under different workloads? > Like - how the number of distinct receivier classes affects the > performance? > How the number of distinct selectors affects the performance? > Is the performance different when receiver classes are distributed > uniformly or normally (which seems to be more common case). > Same for selectors? Is 256 row, 2-way associative cache > better than 128 rows, 4-way associative? > You have number of parameters, for each parameter you define > a number of values and CalipeL work out all possible combinations > and run benchmarks using each. Without parametrization, the number > of benchmark methods would grow exponentially, making hard > to experiment with different setups. For me, this is one of > key things. > > * SMark measures time only. > CalipeL measures time, too, but has the facility to provide a > user-defined "measurement instrument", which can be anything > (what can be measured, indeed). For example, for some web > application the execution time might not be that useful, perhaps > a number of SQL queries it makes is more important. No problem, > define your own measurement instrument and tell CalipeL to use it > in addition to time, number of GCs, you name it. All results of > all instruments are part of machine-readable report, indeed. > > * SMark had no support for "system" profilers and similar. > CalipeL integrates with systemtap/dtrace and cachegrind so one > can have a full profile, including VM code and see things like > L1/L2 I/D cache misses, mispredicted branches or count events > like context switches, monitor signaling, context evacuation. > Useful only for VM engineers I think, but I cannot image doing > my work without this. Available only for Smalltalk/X, but should > not be a big deal adding this to Pharo (simple plugin would do it, > IMO) > > * Finally, SMark spits out a report and that's it. > CalipeL, OTOH, goes beyond that. It tries for provide tools > to gather, store and query results in a centralised way so > nothing is forgotten. > (no more: hmm, where are the results of #perform: benchmarks > I run three months ago? Is it this file? Or that file? Or did I > deleted them when my laptop run out of disk space?) > And yes, I know that in this area there's a lot of space for > improvements. What we have now is certainly not ideal, to put > it mildly :-) > > > Hope that gives you the idea. Thanks Jan! That was quite thorough. I’ll have to take a look at CalipeL sometime. Sure sounds great :) Cheers, Max > > Jan > > > On Sun, 2015-11-01 at 12:11 +0100, Max Leske wrote: >> Hi Jan, >> >> That looks pretty cool! >> We use SMark (http://smalltalkhub.com/#!/~PharoExtras/SMark) for >> benchmarking and CI integration for Fuel. If you know SMark, could >> you give me an idea of what the differences are? >> >> Cheers, >> Max >> >> >>> On 23 Oct 2015, at 10:47, Jan Vrany <[hidden email]> wrote: >>> >>> Hi there, >>> >>> After more than 2 years of (time-to-time) development and about >>> that much time of use, I'd like to announce CalipeL, a tool for >>> benchmarking and monitoring performance regressions. >>> >>> The basic ideas that drove the development: >>> >>> * Benchmarking and (especially) interpreting benchmark results >>> is always a monkey business. The tool should produce raw numbers, >>> letting the user to use whichever statistics she need to make up >>> (desired) results. >>> * Benchmark results should be kept and managed at a single place so >>> one can view and retrieve all past benchmark results pretty much >>> the same way as one can view and retrieve past versions of >>> the software from a source code management tool. >>> >>> Features: >>> >>> - simple - creating a benchmark is as simple as writing a method >>> in a class >>> - flexible - a special set-up and/or warm-up routines could be >>> specified at benchmark-level as well as set of parameters >>> to allow fine-grained measurements under different conditions >>> - batch runner - contains a batch runner allowing one to run >>> benchmarks from a command line or at CI servers such as Jenkins. >>> - web - comes with simple web interface to gather and process >>> benchmark results. However, the web application would deserve >>> some more work. >>> >>> Repository: >>> >>> https://bitbucket.org/janvrany/jv-calipel >>> >>> http://smalltalkhub.com/#!/~JanVrany/CalipeL-S (read-only export >>> from the above and Pharo-specific code) >>> >>> More information: >>> >>> https://bitbucket.org/janvrany/jv-calipel/wiki/Home >>> >>> I have been using CalipeL for benchmarking and keeping track of >>> performance of Smalltalk/X VM, STX:LIBJAVA, a PetitParser compiler >>> and other code I was working over the time. >>> >>> Finally, I'd like to thank to Marcel Hlopko for his work on the >>> web application and Jan Kurs for his comments. >>> >>> I hope some of you may find it useful. If you have any comments >>> or questions, do not hesitate and let me know! >>> >>> Regards, Jan >>> >> >> >> > |
In reply to this post by Jan Vrany
Hi Jan,
Hi Max: I guess, the main issues is missing documentation… Even so, there are class comments… > On 01 Nov 2015, at 23:45, Jan Vrany <[hidden email]> wrote: > > Hi Max, > > I looked at some version of SMark years ago and never used > it extensively, so I might be wrong, but: > > * SMark executor does some magic with numbers. Nope. It does only do that if you ask for it. However, granted, that’s the standard setting, because it is supposed to be used conveniently from within the image. The SMark design knows the concepts reporter (how and what data to report), runner (how to execute benchmarks), suite (the benchmarks), timer (should be named gauge or something, can be everything, doesn’t have to be time). > It tries to > calculate a number of iterations to run in order to get > "statistically meaningful results". Maybe it's me, but > I could not fully understand what it does and why it does it > so. > CalipeL does no magic - it gives you raw numbers (no average, no > mean, > rather a sequence of measurements). See the ReBenchHarness, that’s giving you exactly that as alternative standard setting. > * SMark, IIRC, requires benchmarks to inherit from some base class > (like SUnit). Require is a strong word, as long as you implement the interface of SMarkSuite you can inherit from where ever you want. It’s Smalltalk after all. > Also, not sure if SMark allows you to specify a warmup- > phase (handy for example to measure peak performance when caches are > filled or so). There is the concept of #setup/teardown methods. And, a runner can do what it wants/needs to reach warmup, too. For instance, the SMarkCogRunner will make sure that all code is compiled before starting to measure. > CalipeL, OTOH, uses method annotations to describe the benchmark, > so one can turn a regular SUnit test method into benchmark as simply > as annotating it with <benchmark>. Ok, that’s not possible. > A warmup method and setup/teardown > methods can be specified per-benchmark. We got that too. > * SMark has no support for parametrization. Well, there is the #problemSize parameter, but that is indeed rather simplistic. > > * SMark measures time only. Nope, the SMarkTimer can measure what they want. (and it even got a class comment ;)) > * SMark had no support for “system" profilers and similar. That’s absent, true. > * Finally, SMark spits out a report and that’s it. Well, reports and raw data. I use ReBench [1], and pipe the raw data directly into my latex/knitr/R tool chain to generate graphs/numbers in my papers (example sec. 4 [2], that’s based on a latex file with embedded R). So, I’d say there are some interesting differences. But, much of the mentioned things seem just to be missing ‘documentation’/communication ;) Best regards Stefan [1] https://github.com/smarr/ReBench [2] http://stefan-marr.de/papers/oopsla-marr-ducasse-meta-tracing-vs-partial-evaluation/ |
Thanks Stefan for the follow up.
> On 05 Nov 2015, at 17:17, [hidden email] wrote: > > Hi Jan, > Hi Max: > > I guess, the main issues is missing documentation… > Even so, there are class comments… > >> On 01 Nov 2015, at 23:45, Jan Vrany <[hidden email]> wrote: >> >> Hi Max, >> >> I looked at some version of SMark years ago and never used >> it extensively, so I might be wrong, but: >> >> * SMark executor does some magic with numbers. > > Nope. It does only do that if you ask for it. However, granted, that’s the standard setting, because it is supposed to be used conveniently from within the image. > > The SMark design knows the concepts reporter (how and what data to report), runner (how to execute benchmarks), suite (the benchmarks), timer (should be named gauge or something, can be everything, doesn’t have to be time). > > >> It tries to >> calculate a number of iterations to run in order to get >> "statistically meaningful results". Maybe it's me, but >> I could not fully understand what it does and why it does it >> so. >> CalipeL does no magic - it gives you raw numbers (no average, no >> mean, >> rather a sequence of measurements). > > See the ReBenchHarness, that’s giving you exactly that as alternative standard setting. > >> * SMark, IIRC, requires benchmarks to inherit from some base class >> (like SUnit). > > Require is a strong word, as long as you implement the interface of SMarkSuite you can inherit from where ever you want. It’s Smalltalk after all. > > >> Also, not sure if SMark allows you to specify a warmup- >> phase (handy for example to measure peak performance when caches are >> filled or so). > > There is the concept of #setup/teardown methods. > And, a runner can do what it wants/needs to reach warmup, too. > For instance, the SMarkCogRunner will make sure that all code is compiled before starting to measure. > >> CalipeL, OTOH, uses method annotations to describe the benchmark, >> so one can turn a regular SUnit test method into benchmark as simply >> as annotating it with <benchmark>. > > Ok, that’s not possible. > >> A warmup method and setup/teardown >> methods can be specified per-benchmark. > > We got that too. > >> * SMark has no support for parametrization. > > Well, there is the #problemSize parameter, but that is indeed rather simplistic. > >> >> * SMark measures time only. > > Nope, the SMarkTimer can measure what they want. (and it even got a class comment ;)) > >> * SMark had no support for “system" profilers and similar. > > That’s absent, true. > >> * Finally, SMark spits out a report and that’s it. > > Well, reports and raw data. I use ReBench [1], and pipe the raw data directly into my latex/knitr/R tool chain to generate graphs/numbers in my papers (example sec. 4 [2], that’s based on a latex file with embedded R). > > So, I’d say there are some interesting differences. > But, much of the mentioned things seem just to be missing ‘documentation’/communication ;) > > Best regards > Stefan > > [1] https://github.com/smarr/ReBench > [2] http://stefan-marr.de/papers/oopsla-marr-ducasse-meta-tracing-vs-partial-evaluation/ > > > > |
In reply to this post by Stefan Marr-3
Hi Stefan,
OK, you proved I was wrong. I said I could. Thanks for clarification! Jan On Thu, 2015-11-05 at 17:17 +0100, [hidden email] wrote: > Hi Jan, > Hi Max: > > I guess, the main issues is missing documentation… > Even so, there are class comments… > > > On 01 Nov 2015, at 23:45, Jan Vrany <[hidden email]> wrote: > > > > Hi Max, > > > > I looked at some version of SMark years ago and never used > > it extensively, so I might be wrong, but: > > > > * SMark executor does some magic with numbers. > > Nope. It does only do that if you ask for it. However, granted, > that’s the standard setting, because it is supposed to be used > conveniently from within the image. > > The SMark design knows the concepts reporter (how and what data to > report), runner (how to execute benchmarks), suite (the benchmarks), > timer (should be named gauge or something, can be everything, doesn’t > have to be time). > > > > It tries to > > calculate a number of iterations to run in order to get > > "statistically meaningful results". Maybe it's me, but > > I could not fully understand what it does and why it does it > > so. > > CalipeL does no magic - it gives you raw numbers (no average, no > > mean, > > rather a sequence of measurements). > > See the ReBenchHarness, that’s giving you exactly that as alternative > standard setting. > > > * SMark, IIRC, requires benchmarks to inherit from some base class > > (like SUnit). > > Require is a strong word, as long as you implement the interface of > SMarkSuite you can inherit from where ever you want. It’s Smalltalk > after all. > > > > Also, not sure if SMark allows you to specify a warmup- > > phase (handy for example to measure peak performance when caches > > are > > filled or so). > > There is the concept of #setup/teardown methods. > And, a runner can do what it wants/needs to reach warmup, too. > For instance, the SMarkCogRunner will make sure that all code is > compiled before starting to measure. > > > CalipeL, OTOH, uses method annotations to describe the benchmark, > > so one can turn a regular SUnit test method into benchmark as > > simply > > as annotating it with <benchmark>. > > Ok, that’s not possible. > > > A warmup method and setup/teardown > > methods can be specified per-benchmark. > > We got that too. > > > * SMark has no support for parametrization. > > Well, there is the #problemSize parameter, but that is indeed rather > simplistic. > > > > > * SMark measures time only. > > Nope, the SMarkTimer can measure what they want. (and it even got a > class comment ;)) > > > * SMark had no support for “system" profilers and similar. > > That’s absent, true. > > > * Finally, SMark spits out a report and that’s it. > > Well, reports and raw data. I use ReBench [1], and pipe the raw data > directly into my latex/knitr/R tool chain to generate graphs/numbers > in my papers (example sec. 4 [2], that’s based on a latex file with > embedded R). > > So, I’d say there are some interesting differences. > But, much of the mentioned things seem just to be missing > ‘documentation’/communication ;) > > Best regards > Stefan > > [1] https://github.com/smarr/ReBench > [2] http://stefan-marr.de/papers/oopsla-marr-ducasse-meta-tracing-vs- > partial-evaluation/ > > > > > |
Free forum by Nabble | Edit this page |