Hi guys,
I am storing a lot of data internally in GemStone that I get from a third party lib. Right now I have 1MM objects but likely soon there will some more MM for sure. These objects are kind of "rows" from huge files I receive from this lib as if it were a "database". Anyway, these objects have a string code. These string code WON'T fit in 61 bits or so, so these will NOT be immediate objects. However...each code is repeated in average 20 times. So..in 1MM rows, I could be storing 1MM strings, or... 50000 symbols.... I expect this imported data to increase and increase everyday. In fact I am not even 100% sure the best solution is to store this inside GemStone, but that's a story for another day. My question is...if I use symbols I will be saving (I guess) a lot of space, reducing a lot the number of objects, and likely the number of objects needed in memory (hence I hope I will need less memory/spc). My only worry is about the symbol lookup performance. I don't know how the "Symbol table" is implemented in GemStone. From what I understand, when I CREATE a symbol I pay the lookup in the table but then my object reference that points to the new symbol will directly point to the symbol and not to the entry in the symbol table...so I don't have a new indirection each time I try to access my symbol instance, right? However...if I get a very large table of symbols, I am afraid that ever single #asSymbol I do in my app (from any other use case..fully decouple from this one) would be slowed down. I cannot find AllSymbols dictionary and the deeper I could get to understand was #_existingWithAll: I tried to do some bench and it seems the #asSymbol is still fast even with a much larger SymbolTable (or whatever GemStone equivalent). This may be a tradeoff, but my gut feelings tell me that storing these as symbols is worth. Do you have any suggestion or recommendation? Thanks in advance, _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Mariano,
I think that you are headed in the right direction. Canonicalizing the unique id is the right approach and we do a pretty good job with Symbols. Talking with engineers here, you can expect pretty good performance from Symbols up to around 10 million Symbols ... then you might want to take a different approach. That different approach would involve using a StringKeyValueDictionary and bumping up the number of collision buckets beyond that done for Symbols (which has a fixed number of collision buckets). The downside of managing a StringKeyValueDictionary is that you'd have to worry about conflicts when multiple gems encounter the same id... You don't see the Symbol table, because the AllSymbols dictionary is managed by a separate gem to provide conflict free canonicalization of Symbols ... The AllSymbols dictionary is in the SPC and there are optimizations that let us add new Symbols very efficiently without worry about conflicts. In 3.3 we will be providing a canonicalization framework that would provide support for doing your own conflict canonicalization at which time you would switch to the StringkeyValueDictionary approach... Dale On 03/31/2015 08:32 AM, Mariano
Martinez Peck via Glass wrote:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
On Tue, Mar 31, 2015 at 2:20 PM, Dale Henrichs via Glass <[hidden email]> wrote:
Excellent! Yes, indeed, I got the same impression that the symbol lookup was really fast. Thanks for checking out with the engineers too.
Ok, good to know. So...I think I will go with the symbols approach and every in a while check how many symbol instances I have.
Ok, thanks for the explanation.
That would be nice because I would like to take a similar approach with dates.... I have TONS of equal dates (spread in many different collections) that I would not care to manage them as #== ... BTW...I thought a regular Date would fit as immediate object but re-watching James video about immediate objects, it doesn't seem the case. Wouldn't it be interesting a SmallDate (if necessary) which would be immutable and immediate? may this be useful? most financial apps have tons of dates...
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
On 03/31/2015 10:57 AM, Mariano
Martinez Peck wrote:
Yes, SmallDate was one of the ideas that was thrown around, but then the canonicalization framework approach was chosen because it could handle a broader range of classes, including Dates ... Dale _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
Could you MD5 the strings and store the md5's in the table 'rows' then have a lookup dictionary somewhere that lets you go from the md5 to the original string?
|
In reply to this post by GLASS mailing list
On Tue, Mar 31, 2015 at 3:08 PM, Dale Henrichs <[hidden email]> wrote:
Ok, yeah, that makes sense! Uffff I am eager to read a blog post written by you about the canonicalization framework and taking Dates as an example :) Ok, very cool. Thanks Dale. _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Hi guys, I must admit I find strange to have Symbol GC disabled by default (STN_SYMBOL_GC_ENABLED: false). Even more when the user guide says: "If enabled, symbol garbage collection is performed automatically in the background and requires no management " In addition, in Pharo, SymbolTable is weak, therefore we assume it will automatically be GCe. So...is there any reason I am not seeing why this is disabled by default? Thanks, On Tue, Mar 31, 2015 at 3:12 PM, Mariano Martinez Peck <[hidden email]> wrote:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
On Tue, Mar 31, 2015 at 7:11 PM, Mariano Martinez Peck <[hidden email]> wrote:
Ok. I guess the reason is: Most normal apps do not make lots of symbols. And most symbols of the system are used by class names, selectors, categories, protocols, etc.. And these do not change much (you almost never unload a package). And so...if the table is quite large.. then I would imagine that in the "normal app" the GC might get slower for likely only collecting a few symbols.
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Mariano,
I think the reason that we've disabled it by default is that we don't like to surprise legacy applications ... changing behavior is always dicey business ... ...a few of our customers have expressed a desire to be able to gc symbols so they are motivated to enable it ... the ones that haven't expressed an interest either don't care or are dependent upon the current behavior:) I don't think that gc performance is specifically issue, although it does fall into the area where a legacy app may be sensitive to mfc times and will notice the change:) You make a good point, so it might make sense to enable symbol gc by default for GsDevKit ... I submitted an Issue[1] for this ... Dale [1] https://github.com/GsDevKit/gsDevKitHome/issues/72 On 04/01/2015 06:24 AM, Mariano
Martinez Peck wrote:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
On Wed, Apr 1, 2015 at 1:48 PM, Dale Henrichs <[hidden email]> wrote:
OK. Maybe it would be nice to add something to avoid this. Imagine I know in average I do not release much symbols, yet, I wan't to have the app down (GC time) as less as possible. So say I want to GC symbols only once a week while I do run MFC daily. Is there a way I can trigger the GC of symbols without having to do: 1) Stop stone 2) Modify system.conf to set STN_SYMBOL_GC_ENABLED=TRUE 3) Start stone 4) do all GC stuff... the markForCollection seems to be needed to done TWICE to really free the symbols (this is explained in the docs) 5) STN_SYMBOL_GC_ENABLED=FALSE ? I don;t want to be changing configuration files. Not urgent at all this ... since I CAN actually GC symbols everyday... I just thought the it could be useful. Ohhh...maybe in my code I can check whether I am on Sunday (or whatever) and then via code set #StnSymbolGcEnabled to true??
Ok, good to know.
Great, thanks!
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
On 04/03/2015 08:52 AM, Mariano Martinez Peck wrote:
Yes, STN_SYMBOL_GC_ENABLED[1] can be changed at runtime[2]....DataCurator has GarbageCollection privileges so you can make the change pretty easily from Smalltalk...if you want ... Unless you find that you need to clean up symbols right now, you could let nature take it's course and if you are only gcing symbols once a week, you can probably afford to wait a week for the symbol gc to be finalized ... Dale [1] http://downloads.gemtalksystems.com/docs/GemStone64/3.2.x/GS64-SysAdmin-3.2/A-ConfigOptions.htm#pgfId-493777 [2] http://downloads.gemtalksystems.com/docs/GemStone64/3.2.x/GS64-SysAdmin-3.2/1-Server.htm#pgfId-103257 _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Free forum by Nabble | Edit this page |