Fw: Canonicalization + 2 Spaces

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Fw: Canonicalization + 2 Spaces

Gemstone/S mailing list
Greetings fellow ghosts of GemStone past. It's been nice seeing some of your names in my inbox again after all these years. 

SmallDateAndTime sounds like a great idea. Back in 2004 I posted on predecessor to this list about canonicalizing dates and curve descriptors. I don't think there's an archive available anywhere for those older posts so including it again below in case anyone's interested.

BTW, the work described was on JPMorgan's Kapital system.  See slide 8 of ESUG talk a few months after post below - http://www.esug.org/data/ESUG2004/ValueOfSmalltalk.pdf

-Keith


----- Forwarded Message -----
From: Keith Piraino <[hidden email]>
Sent: Monday, January 12, 2004, 05:52:22 PM EST
Subject: Canonicalization + 2 Spaces

I’ve been working on canonicalizing some objects recently in a
GemStone/VisualWorks system. I haven’t seen discussion in the past
about some of the 2 space issues that come up in this context when
you’re dealing with both GemStone and a client image. I’ll describe the
work we’ve done, and I’d be interested in comments from anyone about
how they’ve tackled similar issues…

The first phase of this work involved dates. We ran a scan and found
that we had 15 million date instances in one of our databases, but they
really only represented 17,000 different days. The few hundred MB
wasted in our (much larger) databases isn’t great, but our bigger
concern was the tens of MB of memory these duplicate instances took up
in images when we faulted them in. The canonicalization on each side
was simple enough. The range of dates we’re interested in is 200 years,
amounting to about 70K instances. In GS we pre-build an array of the
canonical instances and the # days since January 1, 1901 is the index
into the array. In VW we have a similar array that is lazily populated
as needed.

The tricky part is the mapping between the two and supporting
“independent creation” in VW. We don’t want to have to fault in all 70K
dates up front or worse have our VW date creation code forwarding into
the gem at arbitrary points to find the right instance. Tests faulting
all the dates added 30 seconds to our login time, which is definitely
not desirable. Instead we override the faulting and flushing behavior
on dates. We override #newFromGSObjectReport: and parse the report to
get the offset into the canonical array. If a corresponding VW date has
already been created we map to that instead of the instance in the
report. If it’s a new instance to that image we just ensure it ends up
in the local canonical array.

We hook into flushing by using #asGSObjectInSession:. During the first
flush we use #privatePerform: to retrieve the encoded oops of all 70K
canonical dates. As individual dates are flushed we can then create the
appropriate GbsObjects and map them. This way even if the date instance
is created locally in VW it will always end up resolving to the single
corresponding canonical instance in GS. Faulting 70K encoded oops only
takes about a second since they’re SmallIntegers. We process the report
ourselves to avoid intermediate GbsObjects which speeds things up a
little more.

The next phase of this worked involved objects that function as
multi-part keys in our application. They hold more complex data but are
always uniquely identified by a name (Symbol). Years of application
code have relied on the fact that these objects are canonical, and
you’ll never have more than one with the same name. Comparisons use ==,
not =. Until recently this canonicalization was maintained by storing
the instances in multi-level dictionaries that were faulted into the
image. This approach became problematic as the number of instances
increased and a new requirement came along to allow new keys to be
generated at any time, not at defined points.

Some of the basics of our new solution are similar to the date
approach. There’s a canonical structure on each side (dictionary) that
is not replicated. When we fault an object we check for an existing VW
instance and if necessary map to that instead. One new wrinkle on the
faulting side is stubs. Since the application relies so heavily on
identity comparison we have to handle the case where the object was
created locally and registered in the image side dictionary, and later
we attempt to create and map a stub for the real persistent object. If
we allow the stub to be created we effectively have two of our keys
with the same name that are no longer identical. To prevent this we’ve
hacked even more deeply into core replication methods like
#clientObject:namedBuffer:indexableBuffer:slot:lookupOop:forwarder:secondPassLog:cached:keeper:.
If we’re about to create a stub for an instance one of our key classes
we first use fetch operations to retrieve the name, which is one of the
inst var values. (Note that #privateExecute: at this point can cause
moreTraversal errors). We then check the local dictionary and if an
instance with that name has already been created we resolve the
replication to that instance instead of creating a stub. Otherwise we
allow the stub to be created, but then add the stub to the local
dictionary.

On the flushing side we use #privateExecute: to see if the object
exists already in the persistent dictionary. If it does we return the
encoded oop without actually reading the object’s data page using
#_instVarAsEncodedOop: (thanks Norm). From there we can just create a
GbsObject and map just like for dates. If the object doesn’t exist
things get trickier. We have to ensure that it gets added to the
persistent GS dictionary. In order to get this right in the case of
things like concurrency conflicts and various failure scenarios
knowledge of these lazily flushed instances had to be embedded into our
transaction framework.

The end result is that these objects can just be created on the fly in
any image (or gem for that matter) and we always guarantee
canonicalization in both spaces. We’re happy with the result but
curious if anyone has addressed this in a way that involved diving less
deeply into GBS…

Thanks - Keith



__________________________________
Do you Yahoo!?
Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes

_______________________________________________
GemStone-Smalltalk mailing list
[hidden email]
https://lists.gemtalksystems.com/mailman/listinfo/gemstone-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Fw: Canonicalization + 2 Spaces

Gemstone/S mailing list
Hi Keith,

Nice to hear from you. 

I had also implemented Date instance canonicalization much as you described. It was a big improvement for the project at the time because our application model had previously replicated many dates into GBS caches. 

Didn't gemtalk make Date instances immediate objects since maybe around GS 3.0? For DateTime replication tuning the storage of integers was better than canonical instances but required consistent use of conversion accessors. Application code in GS could potentially consume many oops for the transient DateTime instances that might get created but in practice that was not a problem where tuned this way. Having DateTime instances as immediates would reduce the chance that the accessor trick wouldn't be most efficient for some application usage scenarios in gem.

Oh, speaking of replication tuning, one of the big improvements I achieved was through self generating custom replication specs. In tuning mode the replication was one level deep (except some collections) and stubs that got faulted for a declared context/replication would get recorded into contextual replication specs. It was an iterative tuning process but the result was that replication would include no more than was needed and had only deliberate stub faulting later. Avoiding growth of GBS caches had become mission critical, and this achieved it until application code could be reimplemented to run in gems alone. The loss of efficient copy replication had made this necessary, but it was also used tune VW+GBS applications to consistently offer near instant response times.

Looking back, one of the biggest problems with smalltalk in general was that people could too easily write inefficient code. For me it became a full time job cleaning and tuning code that got produced at a fast rate for releases. At least the riggors of C coding brought attention to efficiency from the start. I'm happy to be retired now.

Paul Baumann





On Sun, Jun 28, 2020, 10:35 AM Keith Piraino via GemStone-Smalltalk <[hidden email]> wrote:
Greetings fellow ghosts of GemStone past. It's been nice seeing some of your names in my inbox again after all these years. 

SmallDateAndTime sounds like a great idea. Back in 2004 I posted on predecessor to this list about canonicalizing dates and curve descriptors. I don't think there's an archive available anywhere for those older posts so including it again below in case anyone's interested.

BTW, the work described was on JPMorgan's Kapital system.  See slide 8 of ESUG talk a few months after post below - http://www.esug.org/data/ESUG2004/ValueOfSmalltalk.pdf

-Keith


----- Forwarded Message -----
From: Keith Piraino <[hidden email]>
Sent: Monday, January 12, 2004, 05:52:22 PM EST
Subject: Canonicalization + 2 Spaces

I’ve been working on canonicalizing some objects recently in a
GemStone/VisualWorks system. I haven’t seen discussion in the past
about some of the 2 space issues that come up in this context when
you’re dealing with both GemStone and a client image. I’ll describe the
work we’ve done, and I’d be interested in comments from anyone about
how they’ve tackled similar issues…

The first phase of this work involved dates. We ran a scan and found
that we had 15 million date instances in one of our databases, but they
really only represented 17,000 different days. The few hundred MB
wasted in our (much larger) databases isn’t great, but our bigger
concern was the tens of MB of memory these duplicate instances took up
in images when we faulted them in. The canonicalization on each side
was simple enough. The range of dates we’re interested in is 200 years,
amounting to about 70K instances. In GS we pre-build an array of the
canonical instances and the # days since January 1, 1901 is the index
into the array. In VW we have a similar array that is lazily populated
as needed.

The tricky part is the mapping between the two and supporting
“independent creation” in VW. We don’t want to have to fault in all 70K
dates up front or worse have our VW date creation code forwarding into
the gem at arbitrary points to find the right instance. Tests faulting
all the dates added 30 seconds to our login time, which is definitely
not desirable. Instead we override the faulting and flushing behavior
on dates. We override #newFromGSObjectReport: and parse the report to
get the offset into the canonical array. If a corresponding VW date has
already been created we map to that instead of the instance in the
report. If it’s a new instance to that image we just ensure it ends up
in the local canonical array.

We hook into flushing by using #asGSObjectInSession:. During the first
flush we use #privatePerform: to retrieve the encoded oops of all 70K
canonical dates. As individual dates are flushed we can then create the
appropriate GbsObjects and map them. This way even if the date instance
is created locally in VW it will always end up resolving to the single
corresponding canonical instance in GS. Faulting 70K encoded oops only
takes about a second since they’re SmallIntegers. We process the report
ourselves to avoid intermediate GbsObjects which speeds things up a
little more.

The next phase of this worked involved objects that function as
multi-part keys in our application. They hold more complex data but are
always uniquely identified by a name (Symbol). Years of application
code have relied on the fact that these objects are canonical, and
you’ll never have more than one with the same name. Comparisons use ==,
not =. Until recently this canonicalization was maintained by storing
the instances in multi-level dictionaries that were faulted into the
image. This approach became problematic as the number of instances
increased and a new requirement came along to allow new keys to be
generated at any time, not at defined points.

Some of the basics of our new solution are similar to the date
approach. There’s a canonical structure on each side (dictionary) that
is not replicated. When we fault an object we check for an existing VW
instance and if necessary map to that instead. One new wrinkle on the
faulting side is stubs. Since the application relies so heavily on
identity comparison we have to handle the case where the object was
created locally and registered in the image side dictionary, and later
we attempt to create and map a stub for the real persistent object. If
we allow the stub to be created we effectively have two of our keys
with the same name that are no longer identical. To prevent this we’ve
hacked even more deeply into core replication methods like
#clientObject:namedBuffer:indexableBuffer:slot:lookupOop:forwarder:secondPassLog:cached:keeper:.
If we’re about to create a stub for an instance one of our key classes
we first use fetch operations to retrieve the name, which is one of the
inst var values. (Note that #privateExecute: at this point can cause
moreTraversal errors). We then check the local dictionary and if an
instance with that name has already been created we resolve the
replication to that instance instead of creating a stub. Otherwise we
allow the stub to be created, but then add the stub to the local
dictionary.

On the flushing side we use #privateExecute: to see if the object
exists already in the persistent dictionary. If it does we return the
encoded oop without actually reading the object’s data page using
#_instVarAsEncodedOop: (thanks Norm). From there we can just create a
GbsObject and map just like for dates. If the object doesn’t exist
things get trickier. We have to ensure that it gets added to the
persistent GS dictionary. In order to get this right in the case of
things like concurrency conflicts and various failure scenarios
knowledge of these lazily flushed instances had to be embedded into our
transaction framework.

The end result is that these objects can just be created on the fly in
any image (or gem for that matter) and we always guarantee
canonicalization in both spaces. We’re happy with the result but
curious if anyone has addressed this in a way that involved diving less
deeply into GBS…

Thanks - Keith



__________________________________
Do you Yahoo!?
Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes
_______________________________________________
GemStone-Smalltalk mailing list
[hidden email]
https://lists.gemtalksystems.com/mailman/listinfo/gemstone-smalltalk

_______________________________________________
GemStone-Smalltalk mailing list
[hidden email]
https://lists.gemtalksystems.com/mailman/listinfo/gemstone-smalltalk