Large collection and common practice

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Large collection and common practice

BrunoBB
Hi All,

I have a lot RcKeyValueDictionary where the key is the id of the object and the value is the object itself.
This id once assigned it does NOT change, so far so good :)

The RcKeyValueDictionary is used intensively to add and remove objects (in my case OrbeonFormInstance). The dictionary is very useful because the key is always given as parameter.

Also there are searchs by specific inst var of OrbeonFormInstance class (like username,group, createdTime and so on).

My problem is that i can NOT create an index on aRcKeyValueDictionary.
So which is the commom practice in these cases:
1- Change the RcKeyValueDictionary to be an UnorderedCollection ?
2- Add a new instance variable to the class that holds the RcKeyValueDictionary and this new variable to be anUnorderedCollection ?

1) This will complicate my direct searchs using the ID.
2) Extra computation when adding and removing objects (now there 2 collections to maintain)

The general question will be something like:
When Dictionaries are very suitable to store large quantity of objects but indexes are also needed which solution should be implemented ?

regards
bruno
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list

On Tue, Jan 3, 2017 at 3:14 PM, BrunoBB via Glass <[hidden email]> wrote:
Hi All,

I have a lot RcKeyValueDictionary where the key is the id of the object and
the value is the object itself.
This id once assigned it does NOT change, so far so good :)

The RcKeyValueDictionary is used intensively to add and remove objects (in
my case OrbeonFormInstance). The dictionary is very useful because the key
is always given as parameter.

Also there are searchs by specific inst var of OrbeonFormInstance class
(like username,group, createdTime and so on).

My problem is that i can NOT create an index on aRcKeyValueDictionary.
So which is the commom practice in these cases:
1- Change the RcKeyValueDictionary to be an UnorderedCollection ?
2- Add a new instance variable to the class that holds the
RcKeyValueDictionary and this new variable to be anUnorderedCollection ?

1) This will complicate my direct searchs using the ID.
2) Extra computation when adding and removing objects (now there 2
collections to maintain)

The general question will be something like:
When Dictionaries are very suitable to store large quantity of objects but
indexes are also needed which solution should be implemented ?


Assuming you do need or get benefits from the RC flavor (else it brings unnecessary overhead), then quickly analyzing the situation (until GemStone have indexed and rc-flavor Dictionary impl), I think I would use a RcIdentityBag. I would create a identity index for #id , and yes, you will have to modify your code that access the dict, to know detect on the collection using the identity index of ID. 

But...I am sure someone more experienced will come with a better approach!

Cheers, 



regards
bruno



--
View this message in context: http://forum.world.st/Large-collection-and-common-practice-tp4928607.html
Sent from the GLASS mailing list archive at Nabble.com.
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass



--

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list
Hi Bruno,

Multiple sessions can feed an RC collection with reduced commit conflicts, but you don't always want to keep the objects in the RC collection. One common technique is to have a manager session dedicated to moving objects from RC collections into collections that can be accessed more efficiently. Design so that the manager is the only session that will be updating the collections (so that commit conflicts will not happen). The manager session can do polling for new items and you can add gem-to-gem signaling to wake the manager for more timely responses. The challenges with this kind of design are related to update timing between sessions. The process involves a commit to add to the RC collection, an abort for the manager session to see the objects, a commit by the manager to update the root collection (with RC collection removal, RcQueues are usually used BTW), and an abort by the original session if it needs to see the indexed item was added to the root collection. If there is timing sensitivity with this data then you'll likely resort to searching first in your indexeded collection and then also reviewing objects still in the queue waiting for the manager to process them.

A variation of the manager session technique is to send data to the manager session without doing a commit, this might be through communication between gems or by using session-specific file updates that the manager gem reads. Gem-to-gem signaling can be added to this approach later too if you need to improve timing. This variation can avoid the intermediate commit, but you'd still may need to #continueTransaction to see what the manager session updated. 

I wonder what kind of indexing you would need besides ID. If you don't need to query for anything other than ID then a dictionary would be fine with the ID as key. A dictionary can even use a key that is a custom object that redefines equality and hash from attributes of what is searched for. Merkle tree hashes might also be used as a way to test if some attribute is contained, but that is a bit advanced to go into. Another advanced item that I once implemented was a custom Dictionary where the key was derived from the value by behavior (it was more efficient because it avoided the cost of Association creation). So many cool tricks, I loved working with GS/S.

Paul Baumann



On Tue, Jan 3, 2017 at 2:39 PM, Mariano Martinez Peck via Glass <[hidden email]> wrote:

On Tue, Jan 3, 2017 at 3:14 PM, BrunoBB via Glass <[hidden email]> wrote:
Hi All,

I have a lot RcKeyValueDictionary where the key is the id of the object and
the value is the object itself.
This id once assigned it does NOT change, so far so good :)

The RcKeyValueDictionary is used intensively to add and remove objects (in
my case OrbeonFormInstance). The dictionary is very useful because the key
is always given as parameter.

Also there are searchs by specific inst var of OrbeonFormInstance class
(like username,group, createdTime and so on).

My problem is that i can NOT create an index on aRcKeyValueDictionary.
So which is the commom practice in these cases:
1- Change the RcKeyValueDictionary to be an UnorderedCollection ?
2- Add a new instance variable to the class that holds the
RcKeyValueDictionary and this new variable to be anUnorderedCollection ?

1) This will complicate my direct searchs using the ID.
2) Extra computation when adding and removing objects (now there 2
collections to maintain)

The general question will be something like:
When Dictionaries are very suitable to store large quantity of objects but
indexes are also needed which solution should be implemented ?


Assuming you do need or get benefits from the RC flavor (else it brings unnecessary overhead), then quickly analyzing the situation (until GemStone have indexed and rc-flavor Dictionary impl), I think I would use a RcIdentityBag. I would create a identity index for #id , and yes, you will have to modify your code that access the dict, to know detect on the collection using the identity index of ID. 

But...I am sure someone more experienced will come with a better approach!

Cheers, 



regards
bruno



--
View this message in context: http://forum.world.st/Large-collection-and-common-practice-tp4928607.html
Sent from the GLASS mailing list archive at Nabble.com.
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass



--

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass



_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list
In reply to this post by BrunoBB


On 01/03/2017 10:14 AM, BrunoBB via Glass wrote:

> Hi All,
>
> I have a lot RcKeyValueDictionary where the key is the id of the object and
> the value is the object itself.
> This id once assigned it does NOT change, so far so good :)
>
> The RcKeyValueDictionary is used intensively to add and remove objects (in
> my case OrbeonFormInstance). The dictionary is very useful because the key
> is always given as parameter.
>
> Also there are searchs by specific inst var of OrbeonFormInstance class
> (like username,group, createdTime and so on).
>
> My problem is that i can NOT create an index on aRcKeyValueDictionary.
> So which is the commom practice in these cases:
> 1- Change the RcKeyValueDictionary to be an UnorderedCollection ?
> 2- Add a new instance variable to the class that holds the
> RcKeyValueDictionary and this new variable to be anUnorderedCollection ?
>
> 1) This will complicate my direct searchs using the ID.
> 2) Extra computation when adding and removing objects (now there 2
> collections to maintain)
>
> The general question will be something like:
> When Dictionaries are very suitable to store large quantity of objects but
> indexes are also needed which solution should be implemented ?
>
Bruno,

The general answer to your general question is that if you start out
using a dictionary for lookups of a single field in an object and then
get to the point where you are interested doing queries against multiple
fields in your object _REPLACING_ your dictionary with an indexed
collection starts to make sense.

You can create identity indexes on the fields that are identity-based
like your id field or group (assuming groups are identified by a Symbol
or specific instances of a set of group objects) and use equality
indexes for fields where you cannot use identity (username) where you
are interested in doing queries that involve ranges of values
(createdTime).

If the indexed collection is subject to concurrent additions/deletions,
then you should use an RcIdentityBag. If the objects themselves are
subject to concurrent updates to indexed fields, then you can create
indexes using the `reducedConflict` option.

To do identity-based lookups you would not quite have the convenience of
using `dictionary at: id` as you would need to create a GsQuery that in
it's simplest form would look like (assuming indexedCollection and id
set appropriately):

   ('each.id == id' asQueryOn: indexedCollection)
     bind: 'id' to: id
     queryResult.

The inconvenience of the query would be offset by the fact that you
would still have only one collection to maintain (the RcIdentityBag) and
using indexes means that if any or your fields are changed, the indexes
are automatically updated ...

You can cache a GsQuery instance to avoid the overhead of parsing the
query on every invocation and you can use a Smalltalk API for creating a
GsQuery to avoid the complication of creating a string representation of
queries.

If you are looking for maximum query and update performance, you might
find that custom collections or object structures might well perform
better than using indexes, but the speed advantages have to be offset by
the complexity of maintaining these custom collections. Custom
collections or object structures come into play if you know ahead of
time the types of queries that you want to run ...

As the size of the collections gets very large, you have to keep in mind
that Dictionary-based structures have to be rebuilt periodically to keep
the collision bucket size manageable and some of the dictionaries like
RcKeyValueDictionary will rebuild automatically on insertion and
depending upon the size of the dictionary that could lead to long and
unpredictable delays for end users ... the btree structures used in
GemStone indexes do splits and merges on individual leaf nodes limiting
the cost of insertions ...

I am afraid that there is no simple answer ...

Dale
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list
In reply to this post by GLASS mailing list

Paul,

Thanks for your answer ...

/*  you don't always want to keep the objects in the RC collection

Why you don't always want to keep the objects in the RC collection ?
This is what i'm doing right now :(  - RcKeyValueDictionary

Thanks for the technique you are explaining.

For now i will keep as simple as i can :) may be in the future (next year) i can do something like that but i need to do much more research :)

/* I wonder what kind of indexing you would need besides ID. If you don't need to query for anything other than ID then a dictionary would be fine with the ID as key.

This project/system implement a persistence layer (using rest services) for a Java Application (www.orbeon.com) which is used to design, publish, save and query web forms.
(https://github.com/brunobuzzi/OrbeonPersistenceLayer/)

When designing/publishing/sending/saving form --> the ID is mostly used.
Then you have the Summary page. That display all form instances (saved and sent forms) of some form definition.
Here the user can search by a particular field of the defined form.
A search can be by N different fields depending on the form definition (the definition could be a form with 200 nested fields and sections or whatever).

In this case indexes are very useful but in the previous cases a Dictionary is more suitable using the id (that after assigned is immutable)

regards,
bruno

El 03/01/2017 a las 17:38, Paul Baumann escribió:
Hi Bruno,

Multiple sessions can feed an RC collection with reduced commit conflicts, but you don't always want to keep the objects in the RC collection. One common technique is to have a manager session dedicated to moving objects from RC collections into collections that can be accessed more efficiently. Design so that the manager is the only session that will be updating the collections (so that commit conflicts will not happen). The manager session can do polling for new items and you can add gem-to-gem signaling to wake the manager for more timely responses. The challenges with this kind of design are related to update timing between sessions. The process involves a commit to add to the RC collection, an abort for the manager session to see the objects, a commit by the manager to update the root collection (with RC collection removal, RcQueues are usually used BTW), and an abort by the original session if it needs to see the indexed item was added to the root collection. If there is timing sensitivity with this data then you'll likely resort to searching first in your indexeded collection and then also reviewing objects still in the queue waiting for the manager to process them.

A variation of the manager session technique is to send data to the manager session without doing a commit, this might be through communication between gems or by using session-specific file updates that the manager gem reads. Gem-to-gem signaling can be added to this approach later too if you need to improve timing. This variation can avoid the intermediate commit, but you'd still may need to #continueTransaction to see what the manager session updated. 

I wonder what kind of indexing you would need besides ID. If you don't need to query for anything other than ID then a dictionary would be fine with the ID as key. A dictionary can even use a key that is a custom object that redefines equality and hash from attributes of what is searched for. Merkle tree hashes might also be used as a way to test if some attribute is contained, but that is a bit advanced to go into. Another advanced item that I once implemented was a custom Dictionary where the key was derived from the value by behavior (it was more efficient because it avoided the cost of Association creation). So many cool tricks, I loved working with GS/S.

Paul Baumann



On Tue, Jan 3, 2017 at 2:39 PM, Mariano Martinez Peck via Glass <[hidden email]> wrote:

On Tue, Jan 3, 2017 at 3:14 PM, BrunoBB via Glass <[hidden email]> wrote:
Hi All,

I have a lot RcKeyValueDictionary where the key is the id of the object and
the value is the object itself.
This id once assigned it does NOT change, so far so good :)

The RcKeyValueDictionary is used intensively to add and remove objects (in
my case OrbeonFormInstance). The dictionary is very useful because the key
is always given as parameter.

Also there are searchs by specific inst var of OrbeonFormInstance class
(like username,group, createdTime and so on).

My problem is that i can NOT create an index on aRcKeyValueDictionary.
So which is the commom practice in these cases:
1- Change the RcKeyValueDictionary to be an UnorderedCollection ?
2- Add a new instance variable to the class that holds the
RcKeyValueDictionary and this new variable to be anUnorderedCollection ?

1) This will complicate my direct searchs using the ID.
2) Extra computation when adding and removing objects (now there 2
collections to maintain)

The general question will be something like:
When Dictionaries are very suitable to store large quantity of objects but
indexes are also needed which solution should be implemented ?


Assuming you do need or get benefits from the RC flavor (else it brings unnecessary overhead), then quickly analyzing the situation (until GemStone have indexed and rc-flavor Dictionary impl), I think I would use a RcIdentityBag. I would create a identity index for #id , and yes, you will have to modify your code that access the dict, to know detect on the collection using the identity index of ID. 

But...I am sure someone more experienced will come with a better approach!

Cheers, 



regards
bruno



--
View this message in context: http://forum.world.st/Large-collection-and-common-practice-tp4928607.html
Sent from the GLASS mailing list archive at Nabble.com.
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass



--

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass






Avast logo

El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
www.avast.com



_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list
In reply to this post by GLASS mailing list
Dale,

Thanks for the detailed answer :)

I think replacing the Dictionary with Rc collection will be the less
complex option.

Although adding an additional inst var (an rc collection) it won't be
that complex.

Since to reach 200.000 forms or more it takes time then i will implement
more complex in the near future :)

I will analyze the problem a little more  before selecting another approach.

Thanks to all for your answers !

regards,

bruno

El 03/01/2017 a las 17:46, Dale Henrichs escribió:

>
>
> On 01/03/2017 10:14 AM, BrunoBB via Glass wrote:
>> Hi All,
>>
>> I have a lot RcKeyValueDictionary where the key is the id of the
>> object and
>> the value is the object itself.
>> This id once assigned it does NOT change, so far so good :)
>>
>> The RcKeyValueDictionary is used intensively to add and remove
>> objects (in
>> my case OrbeonFormInstance). The dictionary is very useful because
>> the key
>> is always given as parameter.
>>
>> Also there are searchs by specific inst var of OrbeonFormInstance class
>> (like username,group, createdTime and so on).
>>
>> My problem is that i can NOT create an index on aRcKeyValueDictionary.
>> So which is the commom practice in these cases:
>> 1- Change the RcKeyValueDictionary to be an UnorderedCollection ?
>> 2- Add a new instance variable to the class that holds the
>> RcKeyValueDictionary and this new variable to be anUnorderedCollection ?
>>
>> 1) This will complicate my direct searchs using the ID.
>> 2) Extra computation when adding and removing objects (now there 2
>> collections to maintain)
>>
>> The general question will be something like:
>> When Dictionaries are very suitable to store large quantity of
>> objects but
>> indexes are also needed which solution should be implemented ?
>>
> Bruno,
>
> The general answer to your general question is that if you start out
> using a dictionary for lookups of a single field in an object and then
> get to the point where you are interested doing queries against
> multiple fields in your object _REPLACING_ your dictionary with an
> indexed collection starts to make sense.
>
> You can create identity indexes on the fields that are identity-based
> like your id field or group (assuming groups are identified by a
> Symbol or specific instances of a set of group objects) and use
> equality indexes for fields where you cannot use identity (username)
> where you are interested in doing queries that involve ranges of
> values (createdTime).
>
> If the indexed collection is subject to concurrent
> additions/deletions, then you should use an RcIdentityBag. If the
> objects themselves are subject to concurrent updates to indexed
> fields, then you can create indexes using the `reducedConflict` option.
>
> To do identity-based lookups you would not quite have the convenience
> of using `dictionary at: id` as you would need to create a GsQuery
> that in it's simplest form would look like (assuming indexedCollection
> and id set appropriately):
>
>   ('each.id == id' asQueryOn: indexedCollection)
>     bind: 'id' to: id
>     queryResult.
>
> The inconvenience of the query would be offset by the fact that you
> would still have only one collection to maintain (the RcIdentityBag)
> and using indexes means that if any or your fields are changed, the
> indexes are automatically updated ...
>
> You can cache a GsQuery instance to avoid the overhead of parsing the
> query on every invocation and you can use a Smalltalk API for creating
> a GsQuery to avoid the complication of creating a string
> representation of queries.
>
> If you are looking for maximum query and update performance, you might
> find that custom collections or object structures might well perform
> better than using indexes, but the speed advantages have to be offset
> by the complexity of maintaining these custom collections. Custom
> collections or object structures come into play if you know ahead of
> time the types of queries that you want to run ...
>
> As the size of the collections gets very large, you have to keep in
> mind that Dictionary-based structures have to be rebuilt periodically
> to keep the collision bucket size manageable and some of the
> dictionaries like RcKeyValueDictionary will rebuild automatically on
> insertion and depending upon the size of the dictionary that could
> lead to long and unpredictable delays for end users ... the btree
> structures used in GemStone indexes do splits and merges on individual
> leaf nodes limiting the cost of insertions ...
>
> I am afraid that there is no simple answer ...
>
> Dale
>


---
El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
https://www.avast.com/antivirus

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list
In reply to this post by GLASS mailing list
Hi Bruno,

Why you might not want to keep objects forever in an RC collection:

RC collections have more overhead. Many are implemented with session-specific sub-structures that can be modified by the session with little risk of conflict (the rare conflict is when the RC collection itself changes). Consider if you have an RC collection that was populated from 100 separate sessions and then the query that the RC collection would need to do to see if an object/key exists in the RC collection. RC collections are well implemented and reasonably efficient, they just aren't as efficient at some operations (like lookup) as some non-RC collections (otherwise all would be implemented to have RC behavior). You'll find that RC collections have sometimes unexpected growth and shrinkage behavior. Some grow large session-specific subcollections that may never be cleaned up unless there is at least one removal. Some grow in inopportune moments that can affect time-sensitive operations. I'm not saying that you shouldn't use the RC collections for root collections, it depends on your application needs.

Regarding indexing for many attributes:

Sounds like you want to create indexes on a common collection like OrderedCollection that only one session in in charge of updating. I know that GemTalk had improved their indexing implementation several years back, but some kinds of practical issues likely still remain. A field index updates some underlying structure that might also be updated from changes to other objects by other sessions, updates to indexes used to cause many commit conflicts. The more indexes a collection has, the higher the odds of commit conflict. Applications that I've worked on for the past decade or so didn't use collection indexes like you are about to do. An application that used a lot of indexes also had some custom code to save and replay changes to domain objects to compensate for unpredictable commit failures. It is from experiences like that that the queue-manager approach became useful despite all the cross-session coordination.

I'd probably implement a query kind of object that wraps that collection to support collection-specific queries and maintenance operations. The OC (or whatever you use) would normally be private to the query object. The query object could even have special behavior for avoiding commit conflicts (like locking or queueing for example). The query object might for even be clever enough to do a private/internal RC queue when your application code detects conflict is possible (like from use of locks). The queue object would manage the internal RC collection as practical.

You might think of making that query object a subclass of Collection but any GBS users out there should beware that there would be replication bugs (I'd reported the bug with workaround code to GemTalk many years ago). I doubt you'd be doing replication of something like this even if you used GBS, but just saying there was is a bit of strangeness to be discovered at the basic/private/primitive levels and unfortunately it means that caution applies to user-defined subclasses of Collection.

I'm not suggesting you do this, but it is an option. In the time that indexing was not reliable I'd once resorted to creating my own application-specific indexes. That query object that I just mentioned could also have private dictionary instances that can quickly resolve specific keys (attributes of the objects). The query object has the overhead of also maintaining the private attribute-key dictionaries as object are added and removed. I could go into how I implemented these application-defined indexes even without the query object wrapping it, but no need because you have good GemTalk supported indexes now anyway.

I've presented ideas more complicated than you'll need, hopefully an awareness of potential issues and past remedies will save you some effort.

Regards,

Paul Baumann


On Tue, Jan 3, 2017 at 5:04 PM, Smalltalk <[hidden email]> wrote:

Paul,

Thanks for your answer ...

/*  you don't always want to keep the objects in the RC collection

Why you don't always want to keep the objects in the RC collection ?
This is what i'm doing right now :(  - RcKeyValueDictionary

Thanks for the technique you are explaining.

For now i will keep as simple as i can :) may be in the future (next year) i can do something like that but i need to do much more research :)

/* I wonder what kind of indexing you would need besides ID. If you don't need to query for anything other than ID then a dictionary would be fine with the ID as key.

This project/system implement a persistence layer (using rest services) for a Java Application (www.orbeon.com) which is used to design, publish, save and query web forms.
(https://github.com/brunobuzzi/OrbeonPersistenceLayer/)

When designing/publishing/sending/saving form --> the ID is mostly used.
Then you have the Summary page. That display all form instances (saved and sent forms) of some form definition.
Here the user can search by a particular field of the defined form.
A search can be by N different fields depending on the form definition (the definition could be a form with 200 nested fields and sections or whatever).

In this case indexes are very useful but in the previous cases a Dictionary is more suitable using the id (that after assigned is immutable)

regards,
bruno


El 03/01/2017 a las 17:38, Paul Baumann escribió:
Hi Bruno,

Multiple sessions can feed an RC collection with reduced commit conflicts, but you don't always want to keep the objects in the RC collection. One common technique is to have a manager session dedicated to moving objects from RC collections into collections that can be accessed more efficiently. Design so that the manager is the only session that will be updating the collections (so that commit conflicts will not happen). The manager session can do polling for new items and you can add gem-to-gem signaling to wake the manager for more timely responses. The challenges with this kind of design are related to update timing between sessions. The process involves a commit to add to the RC collection, an abort for the manager session to see the objects, a commit by the manager to update the root collection (with RC collection removal, RcQueues are usually used BTW), and an abort by the original session if it needs to see the indexed item was added to the root collection. If there is timing sensitivity with this data then you'll likely resort to searching first in your indexeded collection and then also reviewing objects still in the queue waiting for the manager to process them.

A variation of the manager session technique is to send data to the manager session without doing a commit, this might be through communication between gems or by using session-specific file updates that the manager gem reads. Gem-to-gem signaling can be added to this approach later too if you need to improve timing. This variation can avoid the intermediate commit, but you'd still may need to #continueTransaction to see what the manager session updated. 

I wonder what kind of indexing you would need besides ID. If you don't need to query for anything other than ID then a dictionary would be fine with the ID as key. A dictionary can even use a key that is a custom object that redefines equality and hash from attributes of what is searched for. Merkle tree hashes might also be used as a way to test if some attribute is contained, but that is a bit advanced to go into. Another advanced item that I once implemented was a custom Dictionary where the key was derived from the value by behavior (it was more efficient because it avoided the cost of Association creation). So many cool tricks, I loved working with GS/S.

Paul Baumann



On Tue, Jan 3, 2017 at 2:39 PM, Mariano Martinez Peck via Glass <[hidden email]> wrote:

On Tue, Jan 3, 2017 at 3:14 PM, BrunoBB via Glass <[hidden email]> wrote:
Hi All,

I have a lot RcKeyValueDictionary where the key is the id of the object and
the value is the object itself.
This id once assigned it does NOT change, so far so good :)

The RcKeyValueDictionary is used intensively to add and remove objects (in
my case OrbeonFormInstance). The dictionary is very useful because the key
is always given as parameter.

Also there are searchs by specific inst var of OrbeonFormInstance class
(like username,group, createdTime and so on).

My problem is that i can NOT create an index on aRcKeyValueDictionary.
So which is the commom practice in these cases:
1- Change the RcKeyValueDictionary to be an UnorderedCollection ?
2- Add a new instance variable to the class that holds the
RcKeyValueDictionary and this new variable to be anUnorderedCollection ?

1) This will complicate my direct searchs using the ID.
2) Extra computation when adding and removing objects (now there 2
collections to maintain)

The general question will be something like:
When Dictionaries are very suitable to store large quantity of objects but
indexes are also needed which solution should be implemented ?


Assuming you do need or get benefits from the RC flavor (else it brings unnecessary overhead), then quickly analyzing the situation (until GemStone have indexed and rc-flavor Dictionary impl), I think I would use a RcIdentityBag. I would create a identity index for #id , and yes, you will have to modify your code that access the dict, to know detect on the collection using the identity index of ID. 

But...I am sure someone more experienced will come with a better approach!

Cheers, 



regards
bruno



--
View this message in context: http://forum.world.st/Large-collection-and-common-practice-tp4928607.html
Sent from the GLASS mailing list archive at Nabble.com.
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass



--

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass






Avast logo

El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
www.avast.com




_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list

Paul,

I welcome all your information that you mention is very useful. This application runs with Seaside no GBS with it's replicates.

Time will tell how i will approach this problem but the information is very useful. And with GemStone there are plenty of solutions :)

For now i will keep it simple maybe in the future i will try some sort of your solution !

regards

bruno


El 03/01/2017 a las 20:33, Paul Baumann escribió:
Hi Bruno,

Why you might not want to keep objects forever in an RC collection:

RC collections have more overhead. Many are implemented with session-specific sub-structures that can be modified by the session with little risk of conflict (the rare conflict is when the RC collection itself changes). Consider if you have an RC collection that was populated from 100 separate sessions and then the query that the RC collection would need to do to see if an object/key exists in the RC collection. RC collections are well implemented and reasonably efficient, they just aren't as efficient at some operations (like lookup) as some non-RC collections (otherwise all would be implemented to have RC behavior). You'll find that RC collections have sometimes unexpected growth and shrinkage behavior. Some grow large session-specific subcollections that may never be cleaned up unless there is at least one removal. Some grow in inopportune moments that can affect time-sensitive operations. I'm not saying that you shouldn't use the RC collections for root collections, it depends on your application needs.

Regarding indexing for many attributes:

Sounds like you want to create indexes on a common collection like OrderedCollection that only one session in in charge of updating. I know that GemTalk had improved their indexing implementation several years back, but some kinds of practical issues likely still remain. A field index updates some underlying structure that might also be updated from changes to other objects by other sessions, updates to indexes used to cause many commit conflicts. The more indexes a collection has, the higher the odds of commit conflict. Applications that I've worked on for the past decade or so didn't use collection indexes like you are about to do. An application that used a lot of indexes also had some custom code to save and replay changes to domain objects to compensate for unpredictable commit failures. It is from experiences like that that the queue-manager approach became useful despite all the cross-session coordination.

I'd probably implement a query kind of object that wraps that collection to support collection-specific queries and maintenance operations. The OC (or whatever you use) would normally be private to the query object. The query object could even have special behavior for avoiding commit conflicts (like locking or queueing for example). The query object might for even be clever enough to do a private/internal RC queue when your application code detects conflict is possible (like from use of locks). The queue object would manage the internal RC collection as practical.

You might think of making that query object a subclass of Collection but any GBS users out there should beware that there would be replication bugs (I'd reported the bug with workaround code to GemTalk many years ago). I doubt you'd be doing replication of something like this even if you used GBS, but just saying there was is a bit of strangeness to be discovered at the basic/private/primitive levels and unfortunately it means that caution applies to user-defined subclasses of Collection.

I'm not suggesting you do this, but it is an option. In the time that indexing was not reliable I'd once resorted to creating my own application-specific indexes. That query object that I just mentioned could also have private dictionary instances that can quickly resolve specific keys (attributes of the objects). The query object has the overhead of also maintaining the private attribute-key dictionaries as object are added and removed. I could go into how I implemented these application-defined indexes even without the query object wrapping it, but no need because you have good GemTalk supported indexes now anyway.

I've presented ideas more complicated than you'll need, hopefully an awareness of potential issues and past remedies will save you some effort.

Regards,

Paul Baumann


On Tue, Jan 3, 2017 at 5:04 PM, Smalltalk <[hidden email]> wrote:

Paul,

Thanks for your answer ...

/*  you don't always want to keep the objects in the RC collection

Why you don't always want to keep the objects in the RC collection ?
This is what i'm doing right now :(  - RcKeyValueDictionary

Thanks for the technique you are explaining.

For now i will keep as simple as i can :) may be in the future (next year) i can do something like that but i need to do much more research :)

/* I wonder what kind of indexing you would need besides ID. If you don't need to query for anything other than ID then a dictionary would be fine with the ID as key.

This project/system implement a persistence layer (using rest services) for a Java Application (www.orbeon.com) which is used to design, publish, save and query web forms.
(https://github.com/brunobuzzi/OrbeonPersistenceLayer/)

When designing/publishing/sending/saving form --> the ID is mostly used.
Then you have the Summary page. That display all form instances (saved and sent forms) of some form definition.
Here the user can search by a particular field of the defined form.
A search can be by N different fields depending on the form definition (the definition could be a form with 200 nested fields and sections or whatever).

In this case indexes are very useful but in the previous cases a Dictionary is more suitable using the id (that after assigned is immutable)

regards,
bruno


El 03/01/2017 a las 17:38, Paul Baumann escribió:
Hi Bruno,

Multiple sessions can feed an RC collection with reduced commit conflicts, but you don't always want to keep the objects in the RC collection. One common technique is to have a manager session dedicated to moving objects from RC collections into collections that can be accessed more efficiently. Design so that the manager is the only session that will be updating the collections (so that commit conflicts will not happen). The manager session can do polling for new items and you can add gem-to-gem signaling to wake the manager for more timely responses. The challenges with this kind of design are related to update timing between sessions. The process involves a commit to add to the RC collection, an abort for the manager session to see the objects, a commit by the manager to update the root collection (with RC collection removal, RcQueues are usually used BTW), and an abort by the original session if it needs to see the indexed item was added to the root collection. If there is timing sensitivity with this data then you'll likely resort to searching first in your indexeded collection and then also reviewing objects still in the queue waiting for the manager to process them.

A variation of the manager session technique is to send data to the manager session without doing a commit, this might be through communication between gems or by using session-specific file updates that the manager gem reads. Gem-to-gem signaling can be added to this approach later too if you need to improve timing. This variation can avoid the intermediate commit, but you'd still may need to #continueTransaction to see what the manager session updated. 

I wonder what kind of indexing you would need besides ID. If you don't need to query for anything other than ID then a dictionary would be fine with the ID as key. A dictionary can even use a key that is a custom object that redefines equality and hash from attributes of what is searched for. Merkle tree hashes might also be used as a way to test if some attribute is contained, but that is a bit advanced to go into. Another advanced item that I once implemented was a custom Dictionary where the key was derived from the value by behavior (it was more efficient because it avoided the cost of Association creation). So many cool tricks, I loved working with GS/S.

Paul Baumann



On Tue, Jan 3, 2017 at 2:39 PM, Mariano Martinez Peck via Glass <[hidden email]> wrote:

On Tue, Jan 3, 2017 at 3:14 PM, BrunoBB via Glass <[hidden email]> wrote:
Hi All,

I have a lot RcKeyValueDictionary where the key is the id of the object and
the value is the object itself.
This id once assigned it does NOT change, so far so good :)

The RcKeyValueDictionary is used intensively to add and remove objects (in
my case OrbeonFormInstance). The dictionary is very useful because the key
is always given as parameter.

Also there are searchs by specific inst var of OrbeonFormInstance class
(like username,group, createdTime and so on).

My problem is that i can NOT create an index on aRcKeyValueDictionary.
So which is the commom practice in these cases:
1- Change the RcKeyValueDictionary to be an UnorderedCollection ?
2- Add a new instance variable to the class that holds the
RcKeyValueDictionary and this new variable to be anUnorderedCollection ?

1) This will complicate my direct searchs using the ID.
2) Extra computation when adding and removing objects (now there 2
collections to maintain)

The general question will be something like:
When Dictionaries are very suitable to store large quantity of objects but
indexes are also needed which solution should be implemented ?


Assuming you do need or get benefits from the RC flavor (else it brings unnecessary overhead), then quickly analyzing the situation (until GemStone have indexed and rc-flavor Dictionary impl), I think I would use a RcIdentityBag. I would create a identity index for #id , and yes, you will have to modify your code that access the dict, to know detect on the collection using the identity index of ID. 

But...I am sure someone more experienced will come with a better approach!

Cheers, 



regards
bruno



--
View this message in context: http://forum.world.st/Large-collection-and-common-practice-tp4928607.html
Sent from the GLASS mailing list archive at Nabble.com.
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass



--

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass






Avast logo

El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
www.avast.com







Avast logo

El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
www.avast.com



_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

Richard Sargent
Administrator
In reply to this post by GLASS mailing list
GLASS mailing list wrote
Hi Bruno,

Why you might not want to keep objects forever in an RC collection:

RC collections have more overhead. Many are implemented with
session-specific sub-structures that can be modified by the session with
little risk of conflict (the rare conflict is when the RC collection itself
changes). Consider if you have an RC collection that was populated from 100
separate sessions and then the query that the RC collection would need to
do to see if an object/key exists in the RC collection. RC collections are
well implemented and reasonably efficient, they just aren't as efficient at
some operations (like lookup) as some non-RC collections (otherwise all
would be implemented to have RC behavior). You'll find that RC collections
have sometimes unexpected growth and shrinkage behavior. Some grow large
session-specific subcollections that may never be cleaned up unless there
is at least one removal. Some grow in inopportune moments that can affect
time-sensitive operations. I'm not saying that you shouldn't use the RC
collections for root collections, it depends on your application needs.

Regarding indexing for many attributes:

Sounds like you want to create indexes on a common collection like
OrderedCollection that only one session in in charge of updating. I know
that GemTalk had improved their indexing implementation several years back,
but some kinds of practical issues likely still remain. A field index
updates some underlying structure that might also be updated from changes
to other objects by other sessions, updates to indexes used to cause many
commit conflicts. The more indexes a collection has, the higher the odds of
commit conflict. Applications that I've worked on for the past decade or so
didn't use collection indexes like you are about to do. An application that
used a lot of indexes also had some custom code to save and replay changes
to domain objects to compensate for unpredictable commit failures. It is
from experiences like that that the queue-manager approach became useful
despite all the cross-session coordination.

I'd probably implement a query kind of object that wraps that collection to
support collection-specific queries and maintenance operations. The OC (or
whatever you use) would normally be private to the query object. The query
object could even have special behavior for avoiding commit conflicts (like
locking or queueing for example). The query object might for even be clever
enough to do a private/internal RC queue when your application code detects
conflict is possible (like from use of locks). The queue object would
manage the internal RC collection as practical.

You might think of making that query object a subclass of Collection but
any GBS users out there should beware that there would be replication bugs
(I'd reported the bug with workaround code to GemTalk many years ago).
Hi Paul,

I tried to look up the bug you describe to report what the outcome of it was. However, I couldn't find a GBS bug using the search criteria ( 'subclass', 'collection', 'replication'). If you can think of anything more precise that would be found in the bug report and/or the approximate date range, I can try to find it.

Thanks,
Richard

 I
doubt you'd be doing replication of something like this even if you used
GBS, but just saying there was is a bit of strangeness to be discovered at
the basic/private/primitive levels and unfortunately it means that caution
applies to user-defined subclasses of Collection.

I'm not suggesting you do this, but it is an option. In the time that
indexing was not reliable I'd once resorted to creating my own
application-specific indexes. That query object that I just mentioned could
also have private dictionary instances that can quickly resolve specific
keys (attributes of the objects). The query object has the overhead of also
maintaining the private attribute-key dictionaries as object are added and
removed. I could go into how I implemented these application-defined
indexes even without the query object wrapping it, but no need because you
have good GemTalk supported indexes now anyway.

I've presented ideas more complicated than you'll need, hopefully an
awareness of potential issues and past remedies will save you some effort.

Regards,

Paul Baumann


On Tue, Jan 3, 2017 at 5:04 PM, Smalltalk <[hidden email]> wrote:

> Paul,
>
> Thanks for your answer ...
>
> /*  you don't always want to keep the objects in the RC collection
>
> Why you don't always want to keep the objects in the RC collection ?
> This is what i'm doing right now :(  - RcKeyValueDictionary
>
> Thanks for the technique you are explaining.
>
> For now i will keep as simple as i can :) may be in the future (next year)
> i can do something like that but i need to do much more research :)
>
> /* I wonder what kind of indexing you would need besides ID. If you don't
> need to query for anything other than ID then a dictionary would be fine
> with the ID as key.
> This project/system implement a persistence layer (using rest services)
> for a Java Application (www.orbeon.com) which is used to design, publish,
> save and query web forms.
> (https://github.com/brunobuzzi/OrbeonPersistenceLayer/)
>
> When designing/publishing/sending/saving form --> the ID is mostly used.
> Then you have the Summary page. That display all form instances (saved and
> sent forms) of some form definition.
> Here the user can search by a particular field of the defined form.
> A search can be by N different fields depending on the form definition
> (the definition could be a form with 200 nested fields and sections or
> whatever).
>
> In this case indexes are very useful but in the previous cases a
> Dictionary is more suitable using the id (that after assigned is immutable)
>
> regards,
> bruno
>
>
> El 03/01/2017 a las 17:38, Paul Baumann escribió:
>
> Hi Bruno,
>
> Multiple sessions can feed an RC collection with reduced commit conflicts,
> but you don't always want to keep the objects in the RC collection. One
> common technique is to have a manager session dedicated to moving objects
> from RC collections into collections that can be accessed more efficiently.
> Design so that the manager is the only session that will be updating the
> collections (so that commit conflicts will not happen). The manager session
> can do polling for new items and you can add gem-to-gem signaling to wake
> the manager for more timely responses. The challenges with this kind of
> design are related to update timing between sessions. The process involves
> a commit to add to the RC collection, an abort for the manager session to
> see the objects, a commit by the manager to update the root collection
> (with RC collection removal, RcQueues are usually used BTW), and an abort
> by the original session if it needs to see the indexed item was added to
> the root collection. If there is timing sensitivity with this data then
> you'll likely resort to searching first in your indexeded collection and
> then also reviewing objects still in the queue waiting for the manager to
> process them.
>
> A variation of the manager session technique is to send data to the
> manager session without doing a commit, this might be through communication
> between gems or by using session-specific file updates that the manager gem
> reads. Gem-to-gem signaling can be added to this approach later too if you
> need to improve timing. This variation can avoid the intermediate commit,
> but you'd still may need to #continueTransaction to see what the manager
> session updated.
>
> I wonder what kind of indexing you would need besides ID. If you don't
> need to query for anything other than ID then a dictionary would be fine
> with the ID as key. A dictionary can even use a key that is a custom object
> that redefines equality and hash from attributes of what is searched for.
> Merkle tree hashes might also be used as a way to test if some attribute is
> contained, but that is a bit advanced to go into. Another advanced item
> that I once implemented was a custom Dictionary where the key was derived
> from the value by behavior (it was more efficient because it avoided the
> cost of Association creation). So many cool tricks, I loved working with
> GS/S.
>
> Paul Baumann
>
>
>
> On Tue, Jan 3, 2017 at 2:39 PM, Mariano Martinez Peck via Glass <
> [hidden email]> wrote:
>
>>
>> On Tue, Jan 3, 2017 at 3:14 PM, BrunoBB via Glass <
>> [hidden email]> wrote:
>>
>>> Hi All,
>>>
>>> I have a lot RcKeyValueDictionary where the key is the id of the object
>>> and
>>> the value is the object itself.
>>> This id once assigned it does NOT change, so far so good :)
>>>
>>> The RcKeyValueDictionary is used intensively to add and remove objects
>>> (in
>>> my case OrbeonFormInstance). The dictionary is very useful because the
>>> key
>>> is always given as parameter.
>>>
>>> Also there are searchs by specific inst var of OrbeonFormInstance class
>>> (like username,group, createdTime and so on).
>>>
>>> My problem is that i can NOT create an index on aRcKeyValueDictionary.
>>> So which is the commom practice in these cases:
>>> 1- Change the RcKeyValueDictionary to be an UnorderedCollection ?
>>> 2- Add a new instance variable to the class that holds the
>>> RcKeyValueDictionary and this new variable to be anUnorderedCollection ?
>>>
>>> 1) This will complicate my direct searchs using the ID.
>>> 2) Extra computation when adding and removing objects (now there 2
>>> collections to maintain)
>>>
>>> The general question will be something like:
>>> When Dictionaries are very suitable to store large quantity of objects
>>> but
>>> indexes are also needed which solution should be implemented ?
>>>
>>
>>
>> Assuming you do need or get benefits from the RC flavor (else it brings
>> unnecessary overhead), then quickly analyzing the situation (until GemStone
>> have indexed and rc-flavor Dictionary impl), I think I would use a
>> RcIdentityBag. I would create a identity index for #id , and yes, you will
>> have to modify your code that access the dict, to know detect on the
>> collection using the identity index of ID.
>>
>> But...I am sure someone more experienced will come with a better approach!
>>
>> Cheers,
>>
>>
>>
>>> regards
>>> bruno
>>>
>>>
>>>
>>> --
>>> View this message in context: http://forum.world.st/Large-co
>>> llection-and-common-practice-tp4928607.html
>>> Sent from the GLASS mailing list archive at Nabble.com.
>>> _______________________________________________
>>> Glass mailing list
>>> [hidden email]
>>> http://lists.gemtalksystems.com/mailman/listinfo/glass
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>> _______________________________________________
>> Glass mailing list
>> [hidden email]
>> http://lists.gemtalksystems.com/mailman/listinfo/glass
>>
>>
>
>
>
> ------------------------------
> [image: Avast logo]
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>
> El software de antivirus Avast ha analizado este correo electrónico en
> busca de virus.
> www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>
>

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list
In reply to this post by GLASS mailing list
Dale,

Is there any measure of from what size a Dictionary could have this
problem ? Maybe it also depends in the length of the key...

(if the problem arise from 1.000.000 entries in a dictionary --> is not
a problem for me right now)

The keys of my Dictionary is an UUID like:
93b1205cec32b42993b9382a9b0d89046fd937c8

At this stage i think i will use 2 Rc collections (one dictionary + one
bag) in the future maybe this approach has to be changed...

For now i will keep it simple !

regards

bruno

El 03/01/2017 a las 17:46, Dale Henrichs escribió:
>
> As the size of the collections gets very large, you have to keep in
> mind that Dictionary-based structures have to be rebuilt periodically
> to keep the collision bucket size manageable and some of the
> dictionaries like RcKeyValueDictionary will rebuild automatically on
> insertion and depending upon the size of the dictionary that could
> lead to long and unpredictable delays for end users ... the btree
> structures used in GemStone indexes do splits and merges on individual
> leaf nodes limiting the cost of insertions ...


---
El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
https://www.avast.com/antivirus

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list
Although UUID are used as unique keys, I find them to be rather wasteful, and it takes time to hash, etc. 

In a number of application I have found that pulling a 64bit cryptographically random number, checking for uniqueness before use, and if so using it as a primary 64 bit integer key reduces bloat and just better for most access operations. 

On Wed, Jan 4, 2017 at 6:23 AM, Smalltalk via Glass <[hidden email]> wrote:
Dale,

Is there any measure of from what size a Dictionary could have this problem ? Maybe it also depends in the length of the key...

(if the problem arise from 1.000.000 entries in a dictionary --> is not a problem for me right now)

The keys of my Dictionary is an UUID like: 93b1205cec32b42993b9382a9b0d89046fd937c8

At this stage i think i will use 2 Rc collections (one dictionary + one bag) in the future maybe this approach has to be changed...

For now i will keep it simple !

regards

bruno

El 03/01/2017 a las 17:46, Dale Henrichs escribió:

As the size of the collections gets very large, you have to keep in mind that Dictionary-based structures have to be rebuilt periodically to keep the collision bucket size manageable and some of the dictionaries like RcKeyValueDictionary will rebuild automatically on insertion and depending upon the size of the dictionary that could lead to long and unpredictable delays for end users ... the btree structures used in GemStone indexes do splits and merges on individual leaf nodes limiting the cost of insertions ...


---
El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
https://www.avast.com/antivirus


_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass



--
===========================================================================
John M. McIntosh. Corporate Smalltalk Consulting Ltd https://www.linkedin.com/in/smalltalk
===========================================================================

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list

John,

The UUID can not be changed because is a data created in a Java application passed to GemStone as a parameter in a service.

regards

bruno


El 04/01/2017 a las 11:29, John McIntosh via Glass escribió:
Although UUID are used as unique keys, I find them to be rather wasteful, and it takes time to hash, etc. 

In a number of application I have found that pulling a 64bit cryptographically random number, checking for uniqueness before use, and if so using it as a primary 64 bit integer key reduces bloat and just better for most access operations. 

On Wed, Jan 4, 2017 at 6:23 AM, Smalltalk via Glass <[hidden email]> wrote:
Dale,

Is there any measure of from what size a Dictionary could have this problem ? Maybe it also depends in the length of the key...

(if the problem arise from 1.000.000 entries in a dictionary --> is not a problem for me right now)

The keys of my Dictionary is an UUID like: 93b1205cec32b42993b9382a9b0d89046fd937c8

At this stage i think i will use 2 Rc collections (one dictionary + one bag) in the future maybe this approach has to be changed...

For now i will keep it simple !

regards

bruno

El 03/01/2017 a las 17:46, Dale Henrichs escribió:

As the size of the collections gets very large, you have to keep in mind that Dictionary-based structures have to be rebuilt periodically to keep the collision bucket size manageable and some of the dictionaries like RcKeyValueDictionary will rebuild automatically on insertion and depending upon the size of the dictionary that could lead to long and unpredictable delays for end users ... the btree structures used in GemStone indexes do splits and merges on individual leaf nodes limiting the cost of insertions ...


---
El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
https://www.avast.com/antivirus


_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass



--
===========================================================================
John M. McIntosh. Corporate Smalltalk Consulting Ltd https://www.linkedin.com/in/smalltalk
===========================================================================


_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass




Avast logo

El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
www.avast.com



_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list
In reply to this post by GLASS mailing list
Bruno,

Again, I don't think there is a simple answer:)

For an RcKeyValueDictionary the dictionary is rebuilt when any one
collision bucket is larger than 1/4 of the total size of the dictionary
... so the rebuild frequency is function of the distribution of the hash
values of the keys as well as the number of entries in the dictionary.

As the size of the dictionary grows, the frequency of rebuilds will go
down, but the potential for having some very large collision bucket goes
up (depending upon the randomness of the hash function for the key
class). The RcKeyValueDictionary does a linear search through a
collision bucket, so large collision buckets can lead to slow query
times ...

Another factor is of course the initial size of the dictionary ... if
you start with a dictionary sized to the expected number of elements
(and your hash function gives a good distribution) then you will not
encounter the "rebuild problem"

I think the best approach is to run some scaling experiments yourself
with what you consider to be large collections ... the dictionary
rebuild time is roughly equivalent to the time it takes to build the
dictionary from scratch sizing the initial dictionary. Take a look at
the KeyValueDictionary>>statistics method to see if you are getting a
good distribution of entries and run some queries for the elements in
the largest collision buckets ...

If you are not seeing any unreasonable performance then you are good to
go ...

Dale


On 1/4/17 6:23 AM, Smalltalk wrote:

> Dale,
>
> Is there any measure of from what size a Dictionary could have this
> problem ? Maybe it also depends in the length of the key...
>
> (if the problem arise from 1.000.000 entries in a dictionary --> is
> not a problem for me right now)
>
> The keys of my Dictionary is an UUID like:
> 93b1205cec32b42993b9382a9b0d89046fd937c8
>
> At this stage i think i will use 2 Rc collections (one dictionary +
> one bag) in the future maybe this approach has to be changed...
>
> For now i will keep it simple !
>
> regards
>
> bruno
>
> El 03/01/2017 a las 17:46, Dale Henrichs escribió:
>>
>> As the size of the collections gets very large, you have to keep in
>> mind that Dictionary-based structures have to be rebuilt periodically
>> to keep the collision bucket size manageable and some of the
>> dictionaries like RcKeyValueDictionary will rebuild automatically on
>> insertion and depending upon the size of the dictionary that could
>> lead to long and unpredictable delays for end users ... the btree
>> structures used in GemStone indexes do splits and merges on
>> individual leaf nodes limiting the cost of insertions ...
>
>
> ---
> El software de antivirus Avast ha analizado este correo electrónico en
> busca de virus.
> https://www.avast.com/antivirus
>

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: Large collection and common practice

GLASS mailing list
Dale,

Thanks for the answer.

Well, i will do more research on this scenario and then i will know what
to do :)

thanks again

regards

bruno


El 04/01/2017 a las 12:38, Dale Henrichs escribió:

> Bruno,
>
> Again, I don't think there is a simple answer:)
>
> For an RcKeyValueDictionary the dictionary is rebuilt when any one
> collision bucket is larger than 1/4 of the total size of the
> dictionary ... so the rebuild frequency is function of the
> distribution of the hash values of the keys as well as the number of
> entries in the dictionary.
>
> As the size of the dictionary grows, the frequency of rebuilds will go
> down, but the potential for having some very large collision bucket
> goes up (depending upon the randomness of the hash function for the
> key class). The RcKeyValueDictionary does a linear search through a
> collision bucket, so large collision buckets can lead to slow query
> times ...
>
> Another factor is of course the initial size of the dictionary ... if
> you start with a dictionary sized to the expected number of elements
> (and your hash function gives a good distribution) then you will not
> encounter the "rebuild problem"
>
> I think the best approach is to run some scaling experiments yourself
> with what you consider to be large collections ... the dictionary
> rebuild time is roughly equivalent to the time it takes to build the
> dictionary from scratch sizing the initial dictionary. Take a look at
> the KeyValueDictionary>>statistics method to see if you are getting a
> good distribution of entries and run some queries for the elements in
> the largest collision buckets ...
>
> If you are not seeing any unreasonable performance then you are good
> to go ...
>
> Dale
>
>
> On 1/4/17 6:23 AM, Smalltalk wrote:
>> Dale,
>>
>> Is there any measure of from what size a Dictionary could have this
>> problem ? Maybe it also depends in the length of the key...
>>
>> (if the problem arise from 1.000.000 entries in a dictionary --> is
>> not a problem for me right now)
>>
>> The keys of my Dictionary is an UUID like:
>> 93b1205cec32b42993b9382a9b0d89046fd937c8
>>
>> At this stage i think i will use 2 Rc collections (one dictionary +
>> one bag) in the future maybe this approach has to be changed...
>>
>> For now i will keep it simple !
>>
>> regards
>>
>> bruno
>>
>> El 03/01/2017 a las 17:46, Dale Henrichs escribió:
>>>
>>> As the size of the collections gets very large, you have to keep in
>>> mind that Dictionary-based structures have to be rebuilt
>>> periodically to keep the collision bucket size manageable and some
>>> of the dictionaries like RcKeyValueDictionary will rebuild
>>> automatically on insertion and depending upon the size of the
>>> dictionary that could lead to long and unpredictable delays for end
>>> users ... the btree structures used in GemStone indexes do splits
>>> and merges on individual leaf nodes limiting the cost of insertions ...
>>
>>
>> ---
>> El software de antivirus Avast ha analizado este correo electrónico
>> en busca de virus.
>> https://www.avast.com/antivirus
>>
>
>


---
El software de antivirus Avast ha analizado este correo electrónico en busca de virus.
https://www.avast.com/antivirus

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass