Smalltalk › Squeak › Squeak - Dev

true hash

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

43 messages Options

123

Paul DeBruicker

Re: DateAndTime hash was: Re: [squeak-dev] true hash

On 05/13/2012 08:04 PM, Levente Uzonyi wrote:

> On Sat, 12 May 2012, Andres Valloud wrote:
>
>> When asserting speedups and collision rates, please also mention what
>> was the dataset used for the measurements. In that way, others can
>> replicate the results and compare hash functions on a consistent basis.
>
> +1. I started hacking it a bit and got another ~2x speedup on my
> benchmark, but it was too simple and I didn't have the time to dig deeper.
>
>
> Levente

The benchmark is described first and then the datasets/collision rate
estimates are below that. With credit to Andres Valloud's work put
forth in his book:
www.amazon.com/Hashing-Smalltalk-Theory-Practice/dp/B00262SJ1M

I've just made a naive version modelled after the process he lays out in
the book. This version only tests dates and times and does not calculate
chi-sq or anything like that. I put my code here:
http://ss3.gemstone.com/ss/TimeHashing
and you also need:
http://www.squeaksource.com/DHBNumerical.html

When I claimed it was 100x faster I was relaying the results of running
this test in a workspace:

now:=DateAndTime now.
[now hash] bench

and changed between the old:

DateAndTime>>#hash
^self asUTC ticks hash

and the new:

DateAndTime>>#hash
| totalSeconds |
totalSeconds := seconds - offset asSeconds.
^ ((totalSeconds // 86400 + jdn) hashMultiply bitXor: totalSeconds \\
86400) bitXor: nanos

I ran it in Squeak4.3 and Pharo1.3 on ubuntu-64 bit with Eliot's cog vm.

Based on Levente's comment I made some more benchmarks that test the
hashing speed for the items in the datasets I used for testing
collisions and included them in the package. It seems like the 100x
holds up when testing a larger set of values but in either case I could
also be messing up the measuring of things I don't intend to. I don't
know.

---------------------------------------------------

For the datasets for the collision rates:

I build two arrays and also used 'DateAndTime allInstances asSet'.

The first array is a random array of DateAndTime instances of a user
specified size where the value for the jdn, second, offset, and nano are
selected independently at random from valid values. For the offsets I
use those listed on Wikipedia. There are 55.

The second array is a sequential array where I create a random
DateAndTime as above but then for a given number of iterations add 1
second, 1 minute, 1 day, 1 week, 30 days, and 1 year to the created date
respectively. So you wind up with an array that has seven sections of
roughly equal length with each element a second, minute, hour, day,
week, month, or year apart. Duplicates are removed before running the
tests.

In the DateAndTimeHashing class there are three class side methods.
(runRandomHashing, runAllInstancesHashing, runSequentialHashing) that
will run and spit out the number of collisions / number of iterations to
the Transcript.

The number of iterations in the version in ss3 is artificially low and
should probably be raised once the code is loaded. The data sets can be
refreshed to different random values by running:

DateAndTimeHashing refreshDataSets.

To modify the hash function that's tested change
DateAndTimeHashing>>#hashOf: I've left the old version and Nicolas
Cellier's version in the comments of that method.

I originally built this to test options for Chronos's Timepoint hash
which is just the 'day since the squeak epoch' in the version for Squeak.

Andres Valloud-4

Re: DateAndTime hash was: Re: [squeak-dev] true hash

In the Cincom public repository, there's a bundle called Hash Analysis
Tool. Also, there's another bundle called Hash Analysis Tool
Extensions. Those two implement a tool to do what you're doing, and
provide something like 300 hash functions and 100 datasets out of the
box. The license is MIT.

I tried to have a decent level of independence between the tool itself
and the GUI, so it shouldn't be *that* hard to port if needed.
Otherwise, it might be easier to simply get a personal use copy of
VisualWorks and load the tool.

You can see some of the thoughts that went into the Hash Analysis Tool here:

http://www.youtube.com/watch?v=jeLGRjQqRf0

Here are some hashing specific comments as well:

http://video.google.com/videoplay?docid=-7510297003494499688&hl=en

You can also get a tool manual here:

ftp://sqrmax.dyndns.org/pub/Papers/

Feedback on any of this is welcome :).

On 5/14/12 0:15 , Paul DeBruicker wrote:
> The benchmark is described first and then the datasets/collision rate
> estimates are below that. With credit to Andres Valloud's work put
> forth in his book:
> www.amazon.com/Hashing-Smalltalk-Theory-Practice/dp/B00262SJ1M
>
> I've just made a naive version modelled after the process he lays out in
> the book. This version only tests dates and times and does not calculate
> chi-sq or anything like that.

Andres Valloud-4

Re: DateAndTime hash was: Re: [squeak-dev] true hash

Oh, also, there is additional implementation specific material in the
Fundamentals book.

On 5/14/12 0:31 , Andres Valloud wrote:

> In the Cincom public repository, there's a bundle called Hash Analysis
> Tool. Also, there's another bundle called Hash Analysis Tool
> Extensions. Those two implement a tool to do what you're doing, and
> provide something like 300 hash functions and 100 datasets out of the
> box. The license is MIT.
>
> I tried to have a decent level of independence between the tool itself
> and the GUI, so it shouldn't be *that* hard to port if needed.
> Otherwise, it might be easier to simply get a personal use copy of
> VisualWorks and load the tool.
>
> You can see some of the thoughts that went into the Hash Analysis Tool here:
>
> http://www.youtube.com/watch?v=jeLGRjQqRf0
>
>
> Here are some hashing specific comments as well:
>
> http://video.google.com/videoplay?docid=-7510297003494499688&hl=en
>
>
> You can also get a tool manual here:
>
> ftp://sqrmax.dyndns.org/pub/Papers/
>
>
> Feedback on any of this is welcome :).
>
> On 5/14/12 0:15 , Paul DeBruicker wrote:
>> The benchmark is described first and then the datasets/collision rate
>> estimates are below that. With credit to Andres Valloud's work put
>> forth in his book:
>> www.amazon.com/Hashing-Smalltalk-Theory-Practice/dp/B00262SJ1M
>>
>> I've just made a naive version modelled after the process he lays out in
>> the book. This version only tests dates and times and does not calculate
>> chi-sq or anything like that.
>
>

123