Smalltalk › Squeak › Squeak - Dev

hashing question

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

5 messages Options

Chris Muller-3

hashing question

Hi all, assuming I have a special-kind of array that can grow without
enumerating the entire old array, is there any way to implement a
Dictionary that can grow without enumerating its internal array? I'll
elaborate.

In our own Dictionary, when we grow there are two reasons why we must
enumerate the entire old array:

1) because Array's in Squeak can't grow; for that we have
OrderedCollection. So we must allocate a new Array and copy the
existing elements to the new-sized array.
2) during this time, the elements are redistributed based on modulo
the new array's size.

I have a special kind of Array (let's call it SpecialArray) that can
grow without having to re-allocate (because it's on disk and files can
grow). I want to use this as the basis for a new kind of Dictionary,
but I want to avoid the need to _redistribute_ the elements.

One idea is to pre-allocate as large a SpecialArray as I think I'll
ever need (say, 30 million) and simply don't grow. At 6-bytes per
pointer, that's a 90MB file, but at least it won't ever get any
bigger. "Disk space is cheap," and, in most cases, this actually is
fine for my needs.

Still, it prevents one from having a lot of instances of SpecialArray
if they wanted, and one doesn't always know what will be "more than
enough".

Is there a more elegant solution that could make use of SpecialArray's
ability to grow, but not redistribute? It seems like not, since not
only we need to use the new "upper half" of the Array for new
elements, we still need to find elements in the "lower-half" (the
first half). The size of the internal array is used in the hash key
calculation, so that would be a problem for finding the old elements..

- Chris

PS - I thought I remember someone posting a book about hashing
algorithms, but I cannot find it..

Levente Uzonyi-2

Re: hashing question

On Wed, 3 Nov 2010, Chris Muller wrote:

> Hi all, assuming I have a special-kind of array that can grow without
> enumerating the entire old array, is there any way to implement a
> Dictionary that can grow without enumerating its internal array? I'll
> elaborate.

I guess you are looking for something like this:
http://en.wikipedia.org/wiki/Linear_hashing
http://en.wikipedia.org/wiki/Consistent_hashing

Levente

>
> In our own Dictionary, when we grow there are two reasons why we must
> enumerate the entire old array:
>
> 1) because Array's in Squeak can't grow; for that we have
> OrderedCollection. So we must allocate a new Array and copy the
> existing elements to the new-sized array.
> 2) during this time, the elements are redistributed based on modulo
> the new array's size.
>
> I have a special kind of Array (let's call it SpecialArray) that can
> grow without having to re-allocate (because it's on disk and files can
> grow). I want to use this as the basis for a new kind of Dictionary,
> but I want to avoid the need to _redistribute_ the elements.
>
> One idea is to pre-allocate as large a SpecialArray as I think I'll
> ever need (say, 30 million) and simply don't grow. At 6-bytes per
> pointer, that's a 90MB file, but at least it won't ever get any
> bigger. "Disk space is cheap," and, in most cases, this actually is
> fine for my needs.
>
> Still, it prevents one from having a lot of instances of SpecialArray
> if they wanted, and one doesn't always know what will be "more than
> enough".
>
> Is there a more elegant solution that could make use of SpecialArray's
> ability to grow, but not redistribute? It seems like not, since not
> only we need to use the new "upper half" of the Array for new
> elements, we still need to find elements in the "lower-half" (the
> first half). The size of the internal array is used in the hash key
> calculation, so that would be a problem for finding the old elements..
>
> - Chris
>
> PS - I thought I remember someone posting a book about hashing
> algorithms, but I cannot find it..
>
>

Andres Valloud-4

Re: hashing question

Depending on the application, you may also want to use hash buckets.

On 11/3/10 11:55 , Levente Uzonyi wrote:

> On Wed, 3 Nov 2010, Chris Muller wrote:
>
>> Hi all, assuming I have a special-kind of array that can grow without
>> enumerating the entire old array, is there any way to implement a
>> Dictionary that can grow without enumerating its internal array? I'll
>> elaborate.
>
> I guess you are looking for something like this:
> http://en.wikipedia.org/wiki/Linear_hashing
> http://en.wikipedia.org/wiki/Consistent_hashing
>
>
> Levente
>
>>
>> In our own Dictionary, when we grow there are two reasons why we must
>> enumerate the entire old array:
>>
>> 1) because Array's in Squeak can't grow; for that we have
>> OrderedCollection. So we must allocate a new Array and copy the
>> existing elements to the new-sized array.
>> 2) during this time, the elements are redistributed based on modulo
>> the new array's size.
>>
>> I have a special kind of Array (let's call it SpecialArray) that can
>> grow without having to re-allocate (because it's on disk and files can
>> grow). I want to use this as the basis for a new kind of Dictionary,
>> but I want to avoid the need to _redistribute_ the elements.
>>
>> One idea is to pre-allocate as large a SpecialArray as I think I'll
>> ever need (say, 30 million) and simply don't grow. At 6-bytes per
>> pointer, that's a 90MB file, but at least it won't ever get any
>> bigger. "Disk space is cheap," and, in most cases, this actually is
>> fine for my needs.
>>
>> Still, it prevents one from having a lot of instances of SpecialArray
>> if they wanted, and one doesn't always know what will be "more than
>> enough".
>>
>> Is there a more elegant solution that could make use of SpecialArray's
>> ability to grow, but not redistribute? It seems like not, since not
>> only we need to use the new "upper half" of the Array for new
>> elements, we still need to find elements in the "lower-half" (the
>> first half). The size of the internal array is used in the hash key
>> calculation, so that would be a problem for finding the old elements..
>>
>> - Chris
>>
>> PS - I thought I remember someone posting a book about hashing
>> algorithms, but I cannot find it..
>>
>>
>
> .
>

Tom Rushworth-3

Re: hashing question

In reply to this post by Chris Muller-3

Hi Chris,

Look at the ideas in a Split-Order-List hash table, which allows expansion without re-hashing. I wrote a sample implementation in Squeak a while back:

-------- begin quote (of my own outgoing mail, a few months ago)

I've created a project on squeaksouce for it, and saved one further round of refactoring/development, but then I ran out of time to take it much further. Look for the project:

http://www.squeaksource.com/SOLHashTable.html

The class comments contain a reference to the main paper I used to implement the algorithm. There's a reasonable amount of testing code somewhere in there too, but my not knowing the ins and outs of the Squeak test framework means the test code is organized as "just a bunch of code" :).
------------- end quote

You'll want to look at the main paper (see the class comment) and then follow the the references back as far as you need to in order to understand the SOL. Or just play with the squeak implementation :). The main paper is all about using SOL for lock-free hashing, but you don't need the lock free stuff, just the non-enumeration for expansion.

On 2010-Nov-3, at 11:16, Chris Muller wrote:

> Hi all, assuming I have a special-kind of array that can grow without
> enumerating the entire old array, is there any way to implement a
> Dictionary that can grow without enumerating its internal array? I'll
> elaborate.
>
> In our own Dictionary, when we grow there are two reasons why we must
> enumerate the entire old array:
>
> 1) because Array's in Squeak can't grow; for that we have
> OrderedCollection. So we must allocate a new Array and copy the
> existing elements to the new-sized array.
> 2) during this time, the elements are redistributed based on modulo
> the new array's size.
>
> I have a special kind of Array (let's call it SpecialArray) that can
> grow without having to re-allocate (because it's on disk and files can
> grow). I want to use this as the basis for a new kind of Dictionary,
> but I want to avoid the need to _redistribute_ the elements.
>
> One idea is to pre-allocate as large a SpecialArray as I think I'll
> ever need (say, 30 million) and simply don't grow. At 6-bytes per
> pointer, that's a 90MB file, but at least it won't ever get any
> bigger. "Disk space is cheap," and, in most cases, this actually is
> fine for my needs.
>
> Still, it prevents one from having a lot of instances of SpecialArray
> if they wanted, and one doesn't always know what will be "more than
> enough".
>
> Is there a more elegant solution that could make use of SpecialArray's
> ability to grow, but not redistribute? It seems like not, since not
> only we need to use the new "upper half" of the Array for new
> elements, we still need to find elements in the "lower-half" (the
> first half). The size of the internal array is used in the hash key
> calculation, so that would be a problem for finding the old elements..
>
> - Chris
>
> PS - I thought I remember someone posting a book about hashing
> algorithms, but I cannot find it..
>

--
Tom Rushworth

Chris Muller-4

Re: hashing question

Oh, great responses, thank you all.