On 05/13/2012 08:04 PM, Levente Uzonyi wrote:
> On Sat, 12 May 2012, Andres Valloud wrote: > >> When asserting speedups and collision rates, please also mention what >> was the dataset used for the measurements. In that way, others can >> replicate the results and compare hash functions on a consistent basis. > > +1. I started hacking it a bit and got another ~2x speedup on my > benchmark, but it was too simple and I didn't have the time to dig deeper. > > > Levente The benchmark is described first and then the datasets/collision rate estimates are below that. With credit to Andres Valloud's work put forth in his book: www.amazon.com/Hashing-Smalltalk-Theory-Practice/dp/B00262SJ1M I've just made a naive version modelled after the process he lays out in the book. This version only tests dates and times and does not calculate chi-sq or anything like that. I put my code here: http://ss3.gemstone.com/ss/TimeHashing and you also need: http://www.squeaksource.com/DHBNumerical.html When I claimed it was 100x faster I was relaying the results of running this test in a workspace: now:=DateAndTime now. [now hash] bench and changed between the old: DateAndTime>>#hash ^self asUTC ticks hash and the new: DateAndTime>>#hash | totalSeconds | totalSeconds := seconds - offset asSeconds. ^ ((totalSeconds // 86400 + jdn) hashMultiply bitXor: totalSeconds \\ 86400) bitXor: nanos I ran it in Squeak4.3 and Pharo1.3 on ubuntu-64 bit with Eliot's cog vm. Based on Levente's comment I made some more benchmarks that test the hashing speed for the items in the datasets I used for testing collisions and included them in the package. It seems like the 100x holds up when testing a larger set of values but in either case I could also be messing up the measuring of things I don't intend to. I don't know. --------------------------------------------------- For the datasets for the collision rates: I build two arrays and also used 'DateAndTime allInstances asSet'. The first array is a random array of DateAndTime instances of a user specified size where the value for the jdn, second, offset, and nano are selected independently at random from valid values. For the offsets I use those listed on Wikipedia. There are 55. The second array is a sequential array where I create a random DateAndTime as above but then for a given number of iterations add 1 second, 1 minute, 1 day, 1 week, 30 days, and 1 year to the created date respectively. So you wind up with an array that has seven sections of roughly equal length with each element a second, minute, hour, day, week, month, or year apart. Duplicates are removed before running the tests. In the DateAndTimeHashing class there are three class side methods. (runRandomHashing, runAllInstancesHashing, runSequentialHashing) that will run and spit out the number of collisions / number of iterations to the Transcript. The number of iterations in the version in ss3 is artificially low and should probably be raised once the code is loaded. The data sets can be refreshed to different random values by running: DateAndTimeHashing refreshDataSets. To modify the hash function that's tested change DateAndTimeHashing>>#hashOf: I've left the old version and Nicolas Cellier's version in the comments of that method. I originally built this to test options for Chronos's Timepoint hash which is just the 'day since the squeak epoch' in the version for Squeak. |
In the Cincom public repository, there's a bundle called Hash Analysis
Tool. Also, there's another bundle called Hash Analysis Tool Extensions. Those two implement a tool to do what you're doing, and provide something like 300 hash functions and 100 datasets out of the box. The license is MIT. I tried to have a decent level of independence between the tool itself and the GUI, so it shouldn't be *that* hard to port if needed. Otherwise, it might be easier to simply get a personal use copy of VisualWorks and load the tool. You can see some of the thoughts that went into the Hash Analysis Tool here: http://www.youtube.com/watch?v=jeLGRjQqRf0 Here are some hashing specific comments as well: http://video.google.com/videoplay?docid=-7510297003494499688&hl=en You can also get a tool manual here: ftp://sqrmax.dyndns.org/pub/Papers/ Feedback on any of this is welcome :). On 5/14/12 0:15 , Paul DeBruicker wrote: > The benchmark is described first and then the datasets/collision rate > estimates are below that. With credit to Andres Valloud's work put > forth in his book: > www.amazon.com/Hashing-Smalltalk-Theory-Practice/dp/B00262SJ1M > > I've just made a naive version modelled after the process he lays out in > the book. This version only tests dates and times and does not calculate > chi-sq or anything like that. |
Oh, also, there is additional implementation specific material in the
Fundamentals book. On 5/14/12 0:31 , Andres Valloud wrote: > In the Cincom public repository, there's a bundle called Hash Analysis > Tool. Also, there's another bundle called Hash Analysis Tool > Extensions. Those two implement a tool to do what you're doing, and > provide something like 300 hash functions and 100 datasets out of the > box. The license is MIT. > > I tried to have a decent level of independence between the tool itself > and the GUI, so it shouldn't be *that* hard to port if needed. > Otherwise, it might be easier to simply get a personal use copy of > VisualWorks and load the tool. > > You can see some of the thoughts that went into the Hash Analysis Tool here: > > http://www.youtube.com/watch?v=jeLGRjQqRf0 > > > Here are some hashing specific comments as well: > > http://video.google.com/videoplay?docid=-7510297003494499688&hl=en > > > You can also get a tool manual here: > > ftp://sqrmax.dyndns.org/pub/Papers/ > > > Feedback on any of this is welcome :). > > On 5/14/12 0:15 , Paul DeBruicker wrote: >> The benchmark is described first and then the datasets/collision rate >> estimates are below that. With credit to Andres Valloud's work put >> forth in his book: >> www.amazon.com/Hashing-Smalltalk-Theory-Practice/dp/B00262SJ1M >> >> I've just made a naive version modelled after the process he lays out in >> the book. This version only tests dates and times and does not calculate >> chi-sq or anything like that. > > |
Free forum by Nabble | Edit this page |