Hi All,
I have a lot RcKeyValueDictionary where the key is the id of the object and the value is the object itself. This id once assigned it does NOT change, so far so good :) The RcKeyValueDictionary is used intensively to add and remove objects (in my case OrbeonFormInstance). The dictionary is very useful because the key is always given as parameter. Also there are searchs by specific inst var of OrbeonFormInstance class (like username,group, createdTime and so on). My problem is that i can NOT create an index on aRcKeyValueDictionary. So which is the commom practice in these cases: 1- Change the RcKeyValueDictionary to be an UnorderedCollection ? 2- Add a new instance variable to the class that holds the RcKeyValueDictionary and this new variable to be anUnorderedCollection ? 1) This will complicate my direct searchs using the ID. 2) Extra computation when adding and removing objects (now there 2 collections to maintain) The general question will be something like: When Dictionaries are very suitable to store large quantity of objects but indexes are also needed which solution should be implemented ? regards bruno |
On Tue, Jan 3, 2017 at 3:14 PM, BrunoBB via Glass <[hidden email]> wrote: Hi All, Assuming you do need or get benefits from the RC flavor (else it brings unnecessary overhead), then quickly analyzing the situation (until GemStone have indexed and rc-flavor Dictionary impl), I think I would use a RcIdentityBag. I would create a identity index for #id , and yes, you will have to modify your code that access the dict, to know detect on the collection using the identity index of ID. But...I am sure someone more experienced will come with a better approach! Cheers,
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Hi Bruno, Multiple sessions can feed an RC collection with reduced commit conflicts, but you don't always want to keep the objects in the RC collection. One common technique is to have a manager session dedicated to moving objects from RC collections into collections that can be accessed more efficiently. Design so that the manager is the only session that will be updating the collections (so that commit conflicts will not happen). The manager session can do polling for new items and you can add gem-to-gem signaling to wake the manager for more timely responses. The challenges with this kind of design are related to update timing between sessions. The process involves a commit to add to the RC collection, an abort for the manager session to see the objects, a commit by the manager to update the root collection (with RC collection removal, RcQueues are usually used BTW), and an abort by the original session if it needs to see the indexed item was added to the root collection. If there is timing sensitivity with this data then you'll likely resort to searching first in your indexeded collection and then also reviewing objects still in the queue waiting for the manager to process them. A variation of the manager session technique is to send data to the manager session without doing a commit, this might be through communication between gems or by using session-specific file updates that the manager gem reads. Gem-to-gem signaling can be added to this approach later too if you need to improve timing. This variation can avoid the intermediate commit, but you'd still may need to #continueTransaction to see what the manager session updated. I wonder what kind of indexing you would need besides ID. If you don't need to query for anything other than ID then a dictionary would be fine with the ID as key. A dictionary can even use a key that is a custom object that redefines equality and hash from attributes of what is searched for. Merkle tree hashes might also be used as a way to test if some attribute is contained, but that is a bit advanced to go into. Another advanced item that I once implemented was a custom Dictionary where the key was derived from the value by behavior (it was more efficient because it avoided the cost of Association creation). So many cool tricks, I loved working with GS/S. Paul Baumann On Tue, Jan 3, 2017 at 2:39 PM, Mariano Martinez Peck via Glass <[hidden email]> wrote:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by BrunoBB
On 01/03/2017 10:14 AM, BrunoBB via Glass wrote: > Hi All, > > I have a lot RcKeyValueDictionary where the key is the id of the object and > the value is the object itself. > This id once assigned it does NOT change, so far so good :) > > The RcKeyValueDictionary is used intensively to add and remove objects (in > my case OrbeonFormInstance). The dictionary is very useful because the key > is always given as parameter. > > Also there are searchs by specific inst var of OrbeonFormInstance class > (like username,group, createdTime and so on). > > My problem is that i can NOT create an index on aRcKeyValueDictionary. > So which is the commom practice in these cases: > 1- Change the RcKeyValueDictionary to be an UnorderedCollection ? > 2- Add a new instance variable to the class that holds the > RcKeyValueDictionary and this new variable to be anUnorderedCollection ? > > 1) This will complicate my direct searchs using the ID. > 2) Extra computation when adding and removing objects (now there 2 > collections to maintain) > > The general question will be something like: > When Dictionaries are very suitable to store large quantity of objects but > indexes are also needed which solution should be implemented ? > The general answer to your general question is that if you start out using a dictionary for lookups of a single field in an object and then get to the point where you are interested doing queries against multiple fields in your object _REPLACING_ your dictionary with an indexed collection starts to make sense. You can create identity indexes on the fields that are identity-based like your id field or group (assuming groups are identified by a Symbol or specific instances of a set of group objects) and use equality indexes for fields where you cannot use identity (username) where you are interested in doing queries that involve ranges of values (createdTime). If the indexed collection is subject to concurrent additions/deletions, then you should use an RcIdentityBag. If the objects themselves are subject to concurrent updates to indexed fields, then you can create indexes using the `reducedConflict` option. To do identity-based lookups you would not quite have the convenience of using `dictionary at: id` as you would need to create a GsQuery that in it's simplest form would look like (assuming indexedCollection and id set appropriately): ('each.id == id' asQueryOn: indexedCollection) bind: 'id' to: id queryResult. The inconvenience of the query would be offset by the fact that you would still have only one collection to maintain (the RcIdentityBag) and using indexes means that if any or your fields are changed, the indexes are automatically updated ... You can cache a GsQuery instance to avoid the overhead of parsing the query on every invocation and you can use a Smalltalk API for creating a GsQuery to avoid the complication of creating a string representation of queries. If you are looking for maximum query and update performance, you might find that custom collections or object structures might well perform better than using indexes, but the speed advantages have to be offset by the complexity of maintaining these custom collections. Custom collections or object structures come into play if you know ahead of time the types of queries that you want to run ... As the size of the collections gets very large, you have to keep in mind that Dictionary-based structures have to be rebuilt periodically to keep the collision bucket size manageable and some of the dictionaries like RcKeyValueDictionary will rebuild automatically on insertion and depending upon the size of the dictionary that could lead to long and unpredictable delays for end users ... the btree structures used in GemStone indexes do splits and merges on individual leaf nodes limiting the cost of insertions ... I am afraid that there is no simple answer ... Dale _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
Paul, Why you don't always want to keep the objects in the RC
collection ? For now i will keep as simple as i can :) may be in the future
(next year) i can do something like that but i need to do much
more research :) /* I wonder what kind of indexing you would need besides ID. If
you don't need to query for anything other than ID then a
dictionary would be fine with the ID as key. (https://github.com/brunobuzzi/OrbeonPersistenceLayer/) When designing/publishing/sending/saving form --> the ID is mostly used. Then you have the Summary page. That display all form instances (saved and sent forms) of some form definition. Here the user can search by a particular field of the defined form. A search can be by N different fields depending on the form definition (the definition could be a form with 200 nested fields and sections or whatever). In this case indexes are very useful but in the previous cases a Dictionary is more suitable using the id (that after assigned is immutable) regards, bruno El 03/01/2017 a las 17:38, Paul Baumann
escribió:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
Dale,
Thanks for the detailed answer :) I think replacing the Dictionary with Rc collection will be the less complex option. Although adding an additional inst var (an rc collection) it won't be that complex. Since to reach 200.000 forms or more it takes time then i will implement more complex in the near future :) I will analyze the problem a little more before selecting another approach. Thanks to all for your answers ! regards, bruno El 03/01/2017 a las 17:46, Dale Henrichs escribió: > > > On 01/03/2017 10:14 AM, BrunoBB via Glass wrote: >> Hi All, >> >> I have a lot RcKeyValueDictionary where the key is the id of the >> object and >> the value is the object itself. >> This id once assigned it does NOT change, so far so good :) >> >> The RcKeyValueDictionary is used intensively to add and remove >> objects (in >> my case OrbeonFormInstance). The dictionary is very useful because >> the key >> is always given as parameter. >> >> Also there are searchs by specific inst var of OrbeonFormInstance class >> (like username,group, createdTime and so on). >> >> My problem is that i can NOT create an index on aRcKeyValueDictionary. >> So which is the commom practice in these cases: >> 1- Change the RcKeyValueDictionary to be an UnorderedCollection ? >> 2- Add a new instance variable to the class that holds the >> RcKeyValueDictionary and this new variable to be anUnorderedCollection ? >> >> 1) This will complicate my direct searchs using the ID. >> 2) Extra computation when adding and removing objects (now there 2 >> collections to maintain) >> >> The general question will be something like: >> When Dictionaries are very suitable to store large quantity of >> objects but >> indexes are also needed which solution should be implemented ? >> > Bruno, > > The general answer to your general question is that if you start out > using a dictionary for lookups of a single field in an object and then > get to the point where you are interested doing queries against > multiple fields in your object _REPLACING_ your dictionary with an > indexed collection starts to make sense. > > You can create identity indexes on the fields that are identity-based > like your id field or group (assuming groups are identified by a > Symbol or specific instances of a set of group objects) and use > equality indexes for fields where you cannot use identity (username) > where you are interested in doing queries that involve ranges of > values (createdTime). > > If the indexed collection is subject to concurrent > additions/deletions, then you should use an RcIdentityBag. If the > objects themselves are subject to concurrent updates to indexed > fields, then you can create indexes using the `reducedConflict` option. > > To do identity-based lookups you would not quite have the convenience > of using `dictionary at: id` as you would need to create a GsQuery > that in it's simplest form would look like (assuming indexedCollection > and id set appropriately): > > ('each.id == id' asQueryOn: indexedCollection) > bind: 'id' to: id > queryResult. > > The inconvenience of the query would be offset by the fact that you > would still have only one collection to maintain (the RcIdentityBag) > and using indexes means that if any or your fields are changed, the > indexes are automatically updated ... > > You can cache a GsQuery instance to avoid the overhead of parsing the > query on every invocation and you can use a Smalltalk API for creating > a GsQuery to avoid the complication of creating a string > representation of queries. > > If you are looking for maximum query and update performance, you might > find that custom collections or object structures might well perform > better than using indexes, but the speed advantages have to be offset > by the complexity of maintaining these custom collections. Custom > collections or object structures come into play if you know ahead of > time the types of queries that you want to run ... > > As the size of the collections gets very large, you have to keep in > mind that Dictionary-based structures have to be rebuilt periodically > to keep the collision bucket size manageable and some of the > dictionaries like RcKeyValueDictionary will rebuild automatically on > insertion and depending upon the size of the dictionary that could > lead to long and unpredictable delays for end users ... the btree > structures used in GemStone indexes do splits and merges on individual > leaf nodes limiting the cost of insertions ... > > I am afraid that there is no simple answer ... > > Dale > --- El software de antivirus Avast ha analizado este correo electrónico en busca de virus. https://www.avast.com/antivirus _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
Hi Bruno, Why you might not want to keep objects forever in an RC collection: Regarding indexing for many attributes: Sounds like you want to create indexes on a common collection like OrderedCollection that only one session in in charge of updating. I know that GemTalk had improved their indexing implementation several years back, but some kinds of practical issues likely still remain. A field index updates some underlying structure that might also be updated from changes to other objects by other sessions, updates to indexes used to cause many commit conflicts. The more indexes a collection has, the higher the odds of commit conflict. Applications that I've worked on for the past decade or so didn't use collection indexes like you are about to do. An application that used a lot of indexes also had some custom code to save and replay changes to domain objects to compensate for unpredictable commit failures. It is from experiences like that that the queue-manager approach became useful despite all the cross-session coordination. I'd probably implement a query kind of object that wraps that collection to support collection-specific queries and maintenance operations. The OC (or whatever you use) would normally be private to the query object. The query object could even have special behavior for avoiding commit conflicts (like locking or queueing for example). The query object might for even be clever enough to do a private/internal RC queue when your application code detects conflict is possible (like from use of locks). The queue object would manage the internal RC collection as practical. You might think of making that query object a subclass of Collection but any GBS users out there should beware that there would be replication bugs (I'd reported the bug with workaround code to GemTalk many years ago). I doubt you'd be doing replication of something like this even if you used GBS, but just saying there was is a bit of strangeness to be discovered at the basic/private/primitive levels and unfortunately it means that caution applies to user-defined subclasses of Collection. I'm not suggesting you do this, but it is an option. In the time that indexing was not reliable I'd once resorted to creating my own application-specific indexes. That query object that I just mentioned could also have private dictionary instances that can quickly resolve specific keys (attributes of the objects). The query object has the overhead of also maintaining the private attribute-key dictionaries as object are added and removed. I could go into how I implemented these application-defined indexes even without the query object wrapping it, but no need because you have good GemTalk supported indexes now anyway. I've presented ideas more complicated than you'll need, hopefully an awareness of potential issues and past remedies will save you some effort. Regards, Paul Baumann On Tue, Jan 3, 2017 at 5:04 PM, Smalltalk <[hidden email]> wrote:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Paul, I welcome all your information that you mention is very useful. This application runs with Seaside no GBS with it's replicates. Time will tell how i will approach this problem but the information is very useful. And with GemStone there are plenty of solutions :) For now i will keep it simple maybe in the future i will try some
sort of your solution ! regards bruno El 03/01/2017 a las 20:33, Paul Baumann
escribió:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Administrator
|
In reply to this post by GLASS mailing list
Hi Paul, I tried to look up the bug you describe to report what the outcome of it was. However, I couldn't find a GBS bug using the search criteria ( 'subclass', 'collection', 'replication'). If you can think of anything more precise that would be found in the bug report and/or the approximate date range, I can try to find it. Thanks, Richard
|
In reply to this post by GLASS mailing list
Dale,
Is there any measure of from what size a Dictionary could have this problem ? Maybe it also depends in the length of the key... (if the problem arise from 1.000.000 entries in a dictionary --> is not a problem for me right now) The keys of my Dictionary is an UUID like: 93b1205cec32b42993b9382a9b0d89046fd937c8 At this stage i think i will use 2 Rc collections (one dictionary + one bag) in the future maybe this approach has to be changed... For now i will keep it simple ! regards bruno El 03/01/2017 a las 17:46, Dale Henrichs escribió: > > As the size of the collections gets very large, you have to keep in > mind that Dictionary-based structures have to be rebuilt periodically > to keep the collision bucket size manageable and some of the > dictionaries like RcKeyValueDictionary will rebuild automatically on > insertion and depending upon the size of the dictionary that could > lead to long and unpredictable delays for end users ... the btree > structures used in GemStone indexes do splits and merges on individual > leaf nodes limiting the cost of insertions ... --- El software de antivirus Avast ha analizado este correo electrónico en busca de virus. https://www.avast.com/antivirus _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Although UUID are used as unique keys, I find them to be rather wasteful, and it takes time to hash, etc. In a number of application I have found that pulling a 64bit cryptographically random number, checking for uniqueness before use, and if so using it as a primary 64 bit integer key reduces bloat and just better for most access operations. On Wed, Jan 4, 2017 at 6:23 AM, Smalltalk via Glass <[hidden email]> wrote: Dale, =========================================================================== John M. McIntosh. Corporate Smalltalk Consulting Ltd https://www.linkedin.com/in/smalltalk =========================================================================== _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
John, The UUID can not be changed because is a data created in a Java application passed to GemStone as a parameter in a service. regards bruno El 04/01/2017 a las 11:29, John
McIntosh via Glass escribió:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
Bruno,
Again, I don't think there is a simple answer:) For an RcKeyValueDictionary the dictionary is rebuilt when any one collision bucket is larger than 1/4 of the total size of the dictionary ... so the rebuild frequency is function of the distribution of the hash values of the keys as well as the number of entries in the dictionary. As the size of the dictionary grows, the frequency of rebuilds will go down, but the potential for having some very large collision bucket goes up (depending upon the randomness of the hash function for the key class). The RcKeyValueDictionary does a linear search through a collision bucket, so large collision buckets can lead to slow query times ... Another factor is of course the initial size of the dictionary ... if you start with a dictionary sized to the expected number of elements (and your hash function gives a good distribution) then you will not encounter the "rebuild problem" I think the best approach is to run some scaling experiments yourself with what you consider to be large collections ... the dictionary rebuild time is roughly equivalent to the time it takes to build the dictionary from scratch sizing the initial dictionary. Take a look at the KeyValueDictionary>>statistics method to see if you are getting a good distribution of entries and run some queries for the elements in the largest collision buckets ... If you are not seeing any unreasonable performance then you are good to go ... Dale On 1/4/17 6:23 AM, Smalltalk wrote: > Dale, > > Is there any measure of from what size a Dictionary could have this > problem ? Maybe it also depends in the length of the key... > > (if the problem arise from 1.000.000 entries in a dictionary --> is > not a problem for me right now) > > The keys of my Dictionary is an UUID like: > 93b1205cec32b42993b9382a9b0d89046fd937c8 > > At this stage i think i will use 2 Rc collections (one dictionary + > one bag) in the future maybe this approach has to be changed... > > For now i will keep it simple ! > > regards > > bruno > > El 03/01/2017 a las 17:46, Dale Henrichs escribió: >> >> As the size of the collections gets very large, you have to keep in >> mind that Dictionary-based structures have to be rebuilt periodically >> to keep the collision bucket size manageable and some of the >> dictionaries like RcKeyValueDictionary will rebuild automatically on >> insertion and depending upon the size of the dictionary that could >> lead to long and unpredictable delays for end users ... the btree >> structures used in GemStone indexes do splits and merges on >> individual leaf nodes limiting the cost of insertions ... > > > --- > El software de antivirus Avast ha analizado este correo electrónico en > busca de virus. > https://www.avast.com/antivirus > _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Dale,
Thanks for the answer. Well, i will do more research on this scenario and then i will know what to do :) thanks again regards bruno El 04/01/2017 a las 12:38, Dale Henrichs escribió: > Bruno, > > Again, I don't think there is a simple answer:) > > For an RcKeyValueDictionary the dictionary is rebuilt when any one > collision bucket is larger than 1/4 of the total size of the > dictionary ... so the rebuild frequency is function of the > distribution of the hash values of the keys as well as the number of > entries in the dictionary. > > As the size of the dictionary grows, the frequency of rebuilds will go > down, but the potential for having some very large collision bucket > goes up (depending upon the randomness of the hash function for the > key class). The RcKeyValueDictionary does a linear search through a > collision bucket, so large collision buckets can lead to slow query > times ... > > Another factor is of course the initial size of the dictionary ... if > you start with a dictionary sized to the expected number of elements > (and your hash function gives a good distribution) then you will not > encounter the "rebuild problem" > > I think the best approach is to run some scaling experiments yourself > with what you consider to be large collections ... the dictionary > rebuild time is roughly equivalent to the time it takes to build the > dictionary from scratch sizing the initial dictionary. Take a look at > the KeyValueDictionary>>statistics method to see if you are getting a > good distribution of entries and run some queries for the elements in > the largest collision buckets ... > > If you are not seeing any unreasonable performance then you are good > to go ... > > Dale > > > On 1/4/17 6:23 AM, Smalltalk wrote: >> Dale, >> >> Is there any measure of from what size a Dictionary could have this >> problem ? Maybe it also depends in the length of the key... >> >> (if the problem arise from 1.000.000 entries in a dictionary --> is >> not a problem for me right now) >> >> The keys of my Dictionary is an UUID like: >> 93b1205cec32b42993b9382a9b0d89046fd937c8 >> >> At this stage i think i will use 2 Rc collections (one dictionary + >> one bag) in the future maybe this approach has to be changed... >> >> For now i will keep it simple ! >> >> regards >> >> bruno >> >> El 03/01/2017 a las 17:46, Dale Henrichs escribió: >>> >>> As the size of the collections gets very large, you have to keep in >>> mind that Dictionary-based structures have to be rebuilt >>> periodically to keep the collision bucket size manageable and some >>> of the dictionaries like RcKeyValueDictionary will rebuild >>> automatically on insertion and depending upon the size of the >>> dictionary that could lead to long and unpredictable delays for end >>> users ... the btree structures used in GemStone indexes do splits >>> and merges on individual leaf nodes limiting the cost of insertions ... >> >> >> --- >> El software de antivirus Avast ha analizado este correo electrónico >> en busca de virus. >> https://www.avast.com/antivirus >> > > --- El software de antivirus Avast ha analizado este correo electrónico en busca de virus. https://www.avast.com/antivirus _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Free forum by Nabble | Edit this page |