Smalltalk › Gemtalk › GLASS

[Glass] How to manage large amounts of data ...

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

5 messages Options

marten

[Glass] How to manage large amounts of data ...

Assuming I have a large number of objects with a unique key attribute
(uuid based).

Normally I use an instance of class Dictionary (or perhaps better:
StringKeyValueDictionary) to store instances of these objects. The
initial finding access to these objects are mostly done via its unique
key attribute.

So the access to an instance was pretty fast:

^aDictionary at: aKey ifAbsent: [ nil ]

But what happens, if this dictionary is getting very large (say: a
million of entries) ?

Is is better to have a different approach then: e.g. a collection with a
defined index on that unique attribute ?

Thanks for answering ...

Marten

--
Marten Feldtmann
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

James Foster-9

Re: [Glass] How to manage large amounts of data ...

Hi Marten,

While it would certainly be interesting to do some performance measurements, a Dictionary should be very efficient even when large. The general approach is to take a hash of the key and use it as an offset into a linear collection of small “buckets” that are then searched using a different algorithm. As long as your keys have a good hash function then things ought to be quite fast—probably faster than using an index.

James

On Aug 14, 2014, at 7:01 AM, [hidden email] wrote:

> Assuming I have a large number of objects with a unique key attribute
> (uuid based).
>
> Normally I use an instance of class Dictionary (or perhaps better:
> StringKeyValueDictionary) to store instances of these objects. The
> initial finding access to these objects are mostly done via its unique
> key attribute.
>
> So the access to an instance was pretty fast:
>
> ^aDictionary at: aKey ifAbsent: [ nil ]
>
> But what happens, if this dictionary is getting very large (say: a
> million of entries) ?
>
> Is is better to have a different approach then: e.g. a collection with a
> defined index on that unique attribute ?
>
>
> Thanks for answering ...
>
> Marten
>
>
> --
> Marten Feldtmann
> _______________________________________________
> Glass mailing list
> [hidden email]
> http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Dale Henrichs-3

Re: [Glass] How to manage large amounts of data ...

In reply to this post by marten

Marten,

I would say that Jame's suggestion is correct ... a dictionary is used for identity-based indexes .

You probably should use an IdentityKeyValueDictionary for quick access, but you will want to keep an eye on the size of the collision buckets in the dictionary ... when you hit the collision bucket limit on an at:put: the dictionary is immediately rebuilt, which can cause an unreasonable delay ... with a large dictionary you would probably want to rebuild the dictionary during off hours ...

IdentitySet would be an even better bet (no collision buckets combined with quick identity based lookups) but to be able to use the UUID for lookup you'd need to store the uuid in the identitySet ... but if you could arrange for the uuid to reference the object directly an identity set would work ...

Dale

On Thu, Aug 14, 2014 at 7:01 AM, [hidden email] <[hidden email]> wrote:

Assuming I have a large number of objects with a unique key attribute
(uuid based).

Normally I use an instance of class Dictionary (or perhaps better:
StringKeyValueDictionary) to store instances of these objects. The
initial finding access to these objects are mostly done via its unique
key attribute.

So the access to an instance was pretty fast:

^aDictionary at: aKey ifAbsent: [ nil ]

But what happens, if this dictionary is getting very large (say: a
million of entries) ?

Is is better to have a different approach then: e.g. a collection with a
defined index on that unique attribute ?

Thanks for answering ...

Marten

--
Marten Feldtmann
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Dale Henrichs-3

Re: [Glass] How to manage large amounts of data ...

Marten,

I inadvertently sent you private email, correcting my previous post... I shouldn't hit send on an empty stomach while getting ready for a plane trip in a couple of hours:), but even that mail was not quite correct

Obviously I missed the bit about having string keys for the UUIDs and I don't know what I was thinking about with the IdentitySet....

The fastest String indexes (using "basic" classes) use the first 12 characters of the string ... the first 12 characters are encoded as an integer, so objects are not faulted in while scanning the btrees during the #= test. If the Strings are longer than 12 characters the first 12 is used as a locator and then message sends (and object faulting) is used (beyond 12 characters performance falls off especially if the Strings are common in the first 12 characters)... The effective "collision bucket" size for a basic index is 500 so a String index with keys less than or equal to 12 characters in length can outperform a StringKeyValueDictionary with equivalent-sized collision buckets on the basis of avoiding object faults ... No rebuilding is required with btrees (they are balanced on the fly) while the dictionary will require full rebuilds to manage the size of collision buckets. Beyond 12 character keys, the StringKeyValueDictionary takes the lead and wins running away:)

If you can convert your key to a unique SmallInteger, you would compare the performance of a SmallInteger based index and an IdentityKeyValue dictionary ... you would not have a range limit like the String-based index and you would not have the faulting penalty in the IdentityKeyValue dictionary. so the difference will come down to collision bucket management and for the IdentityKeyValue case if you can predict your maximum SmallInteger range of values, you should be able to decide how big you want your collision buckets to be and I would think that the IdentityKeyValueDictionary would have the advantage.

Dale

On Thu, Aug 14, 2014 at 8:55 AM, Dale Henrichs <[hidden email]> wrote:

Marten,

I would say that Jame's suggestion is correct ... a dictionary is used for identity-based indexes .

You probably should use an IdentityKeyValueDictionary for quick access, but you will want to keep an eye on the size of the collision buckets in the dictionary ... when you hit the collision bucket limit on an at:put: the dictionary is immediately rebuilt, which can cause an unreasonable delay ... with a large dictionary you would probably want to rebuild the dictionary during off hours ...

IdentitySet would be an even better bet (no collision buckets combined with quick identity based lookups) but to be able to use the UUID for lookup you'd need to store the uuid in the identitySet ... but if you could arrange for the uuid to reference the object directly an identity set would work ...

Dale

On Thu, Aug 14, 2014 at 7:01 AM, [hidden email] <[hidden email]> wrote:

Assuming I have a large number of objects with a unique key attribute
(uuid based).

Normally I use an instance of class Dictionary (or perhaps better:
StringKeyValueDictionary) to store instances of these objects. The
initial finding access to these objects are mostly done via its unique
key attribute.

So the access to an instance was pretty fast:

^aDictionary at: aKey ifAbsent: [ nil ]

But what happens, if this dictionary is getting very large (say: a
million of entries) ?

Is is better to have a different approach then: e.g. a collection with a
defined index on that unique attribute ?

Thanks for answering ...

Marten

--
Marten Feldtmann
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

johnmci

Re: [Glass] How to manage large amounts of data ...

If you are using uuid as unique keys you should think a bit on this due to storage and indexing costs/issues. Background information here:

http://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_.28random.29

Although microsoft used version 1 they and others were pillaged by the press as it exposed the MAC address, so generally people migrated to the random or hash based versions. Generally here you have at least 122 bits to work with so you play the math and say based on a probability of 2^122 there likely won't be a collision within any storage space I can think of.

However in retrospect we converted an app that stored about 10k worth of elements from uuid key based to use a unsigned 32bit integer, via arc4random(). Which gives a 2.3e-6 probability of collision, which we check for at generation time anyway. This led to much better performance and as a IOS app much improved storage costs as we just handle a 32bit integer, not a very long string.

For your case of millions of records if you use a 64bit integer as the key that might be worth while exploring.

I"m sure someone (on a trip somewhere?) could explore the costs of inserting 10M uuids, versus 10M arc4random (or other generator) with or without checking, into some type of dictionary, then retrieve a random million entities...

On Thu, Aug 14, 2014 at 10:12 AM, Dale Henrichs <[hidden email]> wrote:

Marten,

I inadvertently sent you private email, correcting my previous post... I shouldn't hit send on an empty stomach while getting ready for a plane trip in a couple of hours:), but even that mail was not quite correct

Obviously I missed the bit about having string keys for the UUIDs and I don't know what I was thinking about with the IdentitySet....

The fastest String indexes (using "basic" classes) use the first 12 characters of the string ... the first 12 characters are encoded as an integer, so objects are not faulted in while scanning the btrees during the #= test. If the Strings are longer than 12 characters the first 12 is used as a locator and then message sends (and object faulting) is used (beyond 12 characters performance falls off especially if the Strings are common in the first 12 characters)... The effective "collision bucket" size for a basic index is 500 so a String index with keys less than or equal to 12 characters in length can outperform a StringKeyValueDictionary with equivalent-sized collision buckets on the basis of avoiding object faults ... No rebuilding is required with btrees (they are balanced on the fly) while the dictionary will require full rebuilds to manage the size of collision buckets. Beyond 12 character keys, the StringKeyValueDictionary takes the lead and wins running away:)

If you can convert your key to a unique SmallInteger, you would compare the performance of a SmallInteger based index and an IdentityKeyValue dictionary ... you would not have a range limit like the String-based index and you would not have the faulting penalty in the IdentityKeyValue dictionary. so the difference will come down to collision bucket management and for the IdentityKeyValue case if you can predict your maximum SmallInteger range of values, you should be able to decide how big you want your collision buckets to be and I would think that the IdentityKeyValueDictionary would have the advantage.

Dale

On Thu, Aug 14, 2014 at 8:55 AM, Dale Henrichs <[hidden email]> wrote:

Marten,

I would say that Jame's suggestion is correct ... a dictionary is used for identity-based indexes .

You probably should use an IdentityKeyValueDictionary for quick access, but you will want to keep an eye on the size of the collision buckets in the dictionary ... when you hit the collision bucket limit on an at:put: the dictionary is immediately rebuilt, which can cause an unreasonable delay ... with a large dictionary you would probably want to rebuild the dictionary during off hours ...

IdentitySet would be an even better bet (no collision buckets combined with quick identity based lookups) but to be able to use the UUID for lookup you'd need to store the uuid in the identitySet ... but if you could arrange for the uuid to reference the object directly an identity set would work ...

Dale

On Thu, Aug 14, 2014 at 7:01 AM, [hidden email] <[hidden email]> wrote:

Assuming I have a large number of objects with a unique key attribute
(uuid based).

Normally I use an instance of class Dictionary (or perhaps better:
StringKeyValueDictionary) to store instances of these objects. The
initial finding access to these objects are mostly done via its unique
key attribute.

So the access to an instance was pretty fast:

^aDictionary at: aKey ifAbsent: [ nil ]

But what happens, if this dictionary is getting very large (say: a
million of entries) ?

Is is better to have a different approach then: e.g. a collection with a
defined index on that unique attribute ?

Thanks for answering ...

Marten

--
Marten Feldtmann
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

===========================================================================
John M. McIntosh <[hidden email]> https://www.linkedin.com/in/smalltalk
===========================================================================

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass