Hi guys,
I am storing huge lists of "prices". For this, it really helps me to store things ordered (by date)...either in an SequenceableCollection or a SortedCollection. On the other hand, I do want to use indexes to speedup my query to find price for a given date (equality index). But I have found no way to have them both. The only workaround I found is to keep 2 collections for each of these collections, one sorted/ordered, and the other one an unordered one for querying via index. But this is a pain from "developing" point of view, as well as for unnecessary repository growth. Am I missing something? _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Since you are keeping the prices sorted, you can lookup a price in
the sorted list using a binary search. This gives you a worst-case
search of log2(n) +1, which should be pretty fast as long as the
price comparison methods are fast. You should be able to create a
subclass of SortedCollection to do this.
Also keeping 2 collections that each reference the same large set of objects should not be that expensive in terms of repository size. That said, I agree that keeping the prices in 1 collection is a cleaner design. Norm On 8/7/17 11:23, Mariano Martinez Peck
via Glass wrote:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
On Mon, Aug 7, 2017 at 4:27 PM, Norm Green via Glass <[hidden email]> wrote:
Hi Norm, Thanks for the idea. I was using binary search already in other places, so yeah, I can also do that here. However, I never found out a GemStone method for this and instead I ported the implementation from Pharo: SequenceableCollection >> findBinaryIndex: aBlock do: actionBlock ifNone: exceptionBlock "Search for an element in the receiver using binary search. The argument aBlock is a one-element block returning 0 - if the element is the one searched for <0 - if the search should continue in the first half >0 - if the search should continue in the second half If found, evaluate actionBlock with the index as argument If no matching element is found, evaluate exceptionBlock, with the indexes of the 'bounding' elements as arguments. Warning: Might give invalid indexes, see examples below Examples: #(1 3 5 7 11 15 23) findBinaryIndex: [ :arg | 11 - arg ] do: [ :found | found ] ifNone: [ :a :b | 'between: ', {a. b} printString ] #(1 3 5 7 11 15 23) findBinaryIndex: [ :arg | 12 - arg ] do: [ :found | found ] ifNone: [ :a :b | 'between: ', {a. b} printString ] #(1 3 5 7 11 15 23) findBinaryIndex: [ :arg | 0.5 - arg ] do: [ :found | found ] ifNone: [ :a :b | 'between: ', {a. b} printString ] #(1 3 5 7 11 15 23) findBinaryIndex: [ :arg | 25 - arg ] do: [ :found | found ] ifNone: [ :a :b | 'between: ',{a. b} printString ] " | index low high test | low := 1. high := self size. [ index := (high + low) // 2. low > high ] whileFalse: [ test := aBlock value: (self at: index). test = 0 ifTrue: [ ^ actionBlock value: index ] ifFalse: [ test > 0 ifTrue: [ low := index + 1 ] ifFalse: [ high := index - 1 ] ] ]. ^ exceptionBlock cull: high cull: low Do you think that's safe enough or I am missing a GemStone implementation? The only thing I found in GemStone is: SequenceableCollection >> binarySearchIncludes: anObject "Returns true if the argument anObject is equal to an element of the receiver. Returns false otherwise. Note that this search is optimized based on the assumption that the sort criteria is consistent with the equality comparison. That is, if (x <= y) & (y <= x) then (x = y). This assumption will not hold if sorting is done on one instance variable and equality comparison is done on another instance variable (for example). SortedCollection>>#'includes:' is no longer implemented and the linear search is inherited from Collection. See bug 40575." ^ (self indexOf: anObject) ~~ 0
Well.. yes.. but these can have 365 days * 20 years * 30k companies. And this is only daily prices..not even considering minute/hourly/etc. So I am not sure.
Yes...it complicates logic a lot having 2... Cheers,
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Administrator
|
In reply to this post by GLASS mailing list
Mariano, have you read chapter 7 in the GemStone/S 64 Programming Guide? It's all about indexing. It looks like you could define a "range" query over the unordered collection to get the sorted sequence needed to traverse all the prices in date order. The index would be on the date and the range query would be from some "least date" through some "greatest date". You would then use the streaming results or the #do: message I think) to iterate over the query result in date order. There is a section in chapter 7.2 discussing result order. It might be helpful to discuss the use cases for the two collections. When would you iterate over all the dates versus when would you search for specific dates or ranges of dates?
|
On Mon, Aug 7, 2017 at 5:04 PM, Richard Sargent via Glass <[hidden email]> wrote: GLASS mailing list wrote Hi Richard, Thanks for such a pointer. And no, I wasn't aware of that kind of range query. I will read it as soon as I am back or tomorrow first thing in the morning. In the meanwhile, I will quickly answer your below question. It might be helpful to discuss the use cases for the two collections. When I need to support 2 kinds of getting a price from a security: 1) Get the price for given date X and if not found, answer nil. 2) Get the price for a given date X but it it wasn't found, find the closest (but older) one. Kind of "please gave me the latest price you can for this date" 1) can simply call 2) and check if the date answer is the same as the original. However, I want 1) to be as fastest as possible, meaning I may not want to use 2) directly. So far, I was trying this kind of code which obviously doesn't work because I cannot have collection both sorted and with index, but this was the idea anyway: prices := self storageDict at: securityUniqueId ifAbsent: [ nil ]. (prices notNil and: [ prices notEmpty ]) ifTrue: [ onOrBeforeMatching ifTrue: [ | exactPrice closest | exactPrice := prices detect: (DpDatabaseQueryClosureBuilder current selectClosurePriceRecordByDate: dateToCompare) ifNone: [ nil ]. exactPrice ifNil: [ closest := prices first. "we know prices are sorted.." prices do: [ :each | (each date > dateToCompare) ifTrue: [ ^ (closest date <= dateToCompare) ifTrue: [ closest ] ifFalse: [ nil ] ]. closest := each. ]. ^ (closest date <= dateToCompare) ifTrue: [ closest ] ifFalse: [ nil ]. ] ifNotNil: [ ^ exactPrice ] ] ifFalse: [ ^ prices detect: (DpDatabaseQueryClosureBuilder current selectClosurePriceRecordByDate: dateToCompare) ifNone: [ nil ]. ] ]. ^ nil DpDatabaseQueryClosureBuilder answers those magic search block for the indexes.... The boolean onOrBeforeMatching is the one that defines either behavior 1) or 2). My idea was simple. For 1) use the query for the date index. For 2) use also same query and only if it answers nothing, then iterate collection. Sure, I can replace that iteration with a binary search too. But again, all this doesn't work ... Hope this helps clarifying my questions... Thanks a lot in advance,
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Administrator
|
For this, you use the #= predicate and #detect:ifNone:. For this, you use a #<= predicate for the desired date, coupled with #reversedReadStream. Take the first result from the stream (after ensuring it isn't empty, of course).
|
In reply to this post by GLASS mailing list
I think your use case is what
http://smalltalkhub.com/#!/~emaringolo/History is designed for. I haven't implemented "binary search for nearest earlier than X date/time" in it yet though
|
In reply to this post by Richard Sargent
On 8/7/17 1:04 PM, Richard Sargent via Glass wrote: > GLASS mailing list wrote >> Hi guys, >> >> I am storing huge lists of "prices". For this, it really helps me to store >> things ordered (by date)...either in an SequenceableCollection or a >> SortedCollection. On the other hand, I do want to use indexes to speedup >> my >> query to find price for a given date (equality index). >> >> But I have found no way to have them both. The only workaround I found is >> to keep 2 collections for each of these collections, one sorted/ordered, >> and the other one an unordered one for querying via index. But this is a >> pain from "developing" point of view, as well as for unnecessary >> repository >> growth. >> >> Am I missing something? > Mariano, have you read chapter 7 in the GemStone/S 64 Programming Guide? > It's all about indexing. > > It looks like you could define a "range" query over the unordered collection > to get the sorted sequence needed to traverse all the prices in date order. > The index would be on the date and the range query would be from some "least > date" through some "greatest date". You would then use the streaming results > or the #do: message I think) to iterate over the query result in date order. > There is a section in chapter 7.2 discussing result order. > > > It might be helpful to discuss the use cases for the two collections. When > would you iterate over all the dates versus when would you search for > specific dates or ranges of dates? > > the exact date: detectQuery := (GsQuery fromString: 'each.value = targetDate' on: nsc) and one query to find the first date prior to your targetDate: nearestQuery := (GsQuery fromString: 'each.value < targetDate' on: nsc) Then you can use detect:ifNone: sent to the query object itself to either find the first matching date using the index or the first date prior to the target data using an index ... in both queries you are using very efficient btree lookups and avoiding the need to scan your collection (key is the price and value if the date): | nsc random maxYear detectQuery targetDate result | nsc := IdentityBag new. random := HostRandom new. GsIndexSpec new equalityIndex: 'value' lastElementClass: Date; createIndexesOn: nsc. 1 to: 100 do: [ :index | nsc add: (ScaledDecimal for: random float scale: 2) -> (Date newDay: (random integerBetween: 1 and: 365) year: (random integerBetween: 2000 and: 2017)) ]. targetDate := Date newDay: 250 year: 2011. detectQuery := (GsQuery fromString: 'each.value = targetDate' on: nsc) bind: 'targetDate' to: targetDate. result := detectQuery detect: [ :date | true ] ifNone: [ | nearestQuery | nearestQuery := (GsQuery fromString: 'each.value < targetDate' on: nsc) bind: 'targetDate' to: targetDate. nearestQuery reversedReadStream next]. {nsc. nsc sortAscending: 'value'. targetDate. result} I'm returning the sorted nsc, to make it easy to validate that the result is correct, since we're generating random dates. The parsed query can be persisted (with or without the nsc attached) to avoid the overhead of parsing the query string each time you run the query ... _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Here's an updated script to show how to use a query with two different
dates ... targetDate1 is a random date presumed to miss any existing data ... targetDate2 is a date from a known sample ...... | nsc random maxYear detectQuery targetDate1 targetDate2 result1 result2 nearestQuery theDate | nsc := IdentityBag new. random := HostRandom new. GsIndexSpec new equalityIndex: 'value' lastElementClass: Date; createIndexesOn: nsc. detectQuery := GsQuery fromString: 'each.value = targetDate' on: nsc. nearestQuery := GsQuery fromString: 'each.value < targetDate' on: nsc. 1 to: 100 do: [ :index | nsc add: (ScaledDecimal for: random float scale: 2) -> (Date newDay: (random integerBetween: 1 and: 365) year: (random integerBetween: 2000 and: 2017)) ]. targetDate1 := Date newDay: 250 year: 2011. detectQuery bind: 'targetDate' to: targetDate1. nearestQuery bind: 'targetDate' to: targetDate1. result1 := detectQuery detect: [ :date | true ] ifNone: [ nearestQuery reversedReadStream next]. targetDate2 := (nsc _at: 50) value. detectQuery bind: 'targetDate' to: targetDate2. nearestQuery bind: 'targetDate' to: targetDate2. result2 := detectQuery detect: [ :date | true ] ifNone: [ nearestQuery reversedReadStream next]. {nsc. nsc sortAscending: 'value'. targetDate1. result1. targetDate2. result2} On 8/7/17 6:47 PM, Dale Henrichs wrote: > > > On 8/7/17 1:04 PM, Richard Sargent via Glass wrote: >> GLASS mailing list wrote >>> Hi guys, >>> >>> I am storing huge lists of "prices". For this, it really helps me to >>> store >>> things ordered (by date)...either in an SequenceableCollection or a >>> SortedCollection. On the other hand, I do want to use indexes to >>> speedup >>> my >>> query to find price for a given date (equality index). >>> >>> But I have found no way to have them both. The only workaround I >>> found is >>> to keep 2 collections for each of these collections, one >>> sorted/ordered, >>> and the other one an unordered one for querying via index. But this >>> is a >>> pain from "developing" point of view, as well as for unnecessary >>> repository >>> growth. >>> >>> Am I missing something? >> Mariano, have you read chapter 7 in the GemStone/S 64 Programming Guide? >> It's all about indexing. >> >> It looks like you could define a "range" query over the unordered >> collection >> to get the sorted sequence needed to traverse all the prices in date >> order. >> The index would be on the date and the range query would be from some >> "least >> date" through some "greatest date". You would then use the streaming >> results >> or the #do: message I think) to iterate over the query result in date >> order. >> There is a section in chapter 7.2 discussing result order. >> >> >> It might be helpful to discuss the use cases for the two collections. >> When >> would you iterate over all the dates versus when would you search for >> specific dates or ranges of dates? >> >> > Actually I think you can just set up two queries ... one to search for > the exact date: > > detectQuery := (GsQuery fromString: 'each.value = targetDate' on: nsc) > > and one query to find the first date prior to your targetDate: > > nearestQuery := (GsQuery fromString: 'each.value < targetDate' on: nsc) > > > Then you can use detect:ifNone: sent to the query object itself to > either find the first matching date using the index or the first date > prior to the target data using an index ... in both queries you are > using very efficient btree lookups and avoiding the need to scan your > collection (key is the price and value if the date): > > | nsc random maxYear detectQuery targetDate result | > nsc := IdentityBag new. > random := HostRandom new. > GsIndexSpec new > equalityIndex: 'value' lastElementClass: Date; > createIndexesOn: nsc. > 1 to: 100 do: [ :index | > nsc > add: (ScaledDecimal for: random float scale: 2) -> > (Date > newDay: (random integerBetween: 1 and: 365) > year: (random integerBetween: 2000 and: 2017)) ]. > targetDate := Date newDay: 250 year: 2011. > detectQuery := (GsQuery fromString: 'each.value = targetDate' > on: nsc) > bind: 'targetDate' > to: targetDate. > result := detectQuery > detect: [ :date | true ] > ifNone: [ | nearestQuery | > nearestQuery := (GsQuery fromString: 'each.value < > targetDate' on: nsc) > bind: 'targetDate' > to: targetDate. > nearestQuery reversedReadStream next]. > {nsc. nsc sortAscending: 'value'. targetDate. result} > > I'm returning the sorted nsc, to make it easy to validate that the > result is correct, since we're generating random dates. > The parsed query can be persisted (with or without the nsc attached) > to avoid the overhead of parsing the query string each time you run > the query ... > _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
On Mon, Aug 7, 2017 at 10:47 PM, Dale Henrichs via Glass <[hidden email]> wrote:
Hi Dale, Thanks for your answers. I am headed bed right now, but I would do a quick question so to have as much as possible info for tomorrow... Aside from both needed queries above, I would still need this 2 more APIs: 3) get the whole price history sorted. 4) get the newest available price of the collection How can I make those fast too taking advantage of the index? For 3) I can simply do the nearestQuery with a future day (like tomorrow), but maybe there is a cleaner way? And for 4? In summary.... I am comparing whether store things sorted (so that 3 and 4 are fast...3 is a simple #last and 4 is simple same collection..nothing to do ) and make binary search for exact query and nearest query, vs store things in a bag and do everything via index (querying will be fast, but I doubt about 3) and 4)... i mostly doubt about 3).. Thanks a lot! Then you can use detect:ifNone: sent to the query object itself to either find the first matching date using the index or the first date prior to the target data using an index ... in both queries you are using very efficient btree lookups and avoiding the need to scan your collection (key is the price and value if the date): _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
On 8/7/17 7:53 PM, Mariano Martinez
Peck wrote:
You've got several options here .. sortAscending: produces a sorted collection, a GsQuery combined with a do:, detect:, etc., a GsQuery combined with readStream, a separate SortedCollection ... (as i described in response to your "can't we use indexes for sorting too" thread you get the last price from a GsQuery using a reversedReadStream ... You are right if you really need to produce the sorted price history quickly, then your best bet is to keep two collections ... use indexed queries for fast price lookups and the sorted collection for price history, first and last ... Dale
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Administrator
|
In reply to this post by GLASS mailing list
Mariano, given the size of this collection, how would having all 1 million prices (or 10 million?) be usable? #do: would allow you to iterate across the collection indexed by date and get the prices in date order as would streaming over the result. Dale points out that you can get the result set as a collection, but what is it you would do with it such that you need the entire sorted collection at one time?
|
Thank you all for your ideas and discussions. I learned a lot: 1) I wasn't aware of the streaming API (even less than results could be sorted using the index) of GsQuery (until now I was only using the { } kind of syntax). 2) I didn't know that #sortAscending: and friends could also take benefits of the index for sorting. Both things are very cool. For my particular case, I finally decided to keep both collection... a SortedCollection for getting "last price" and "all price history sorted" very fast, and a "Bag with indexes" for querying prices for a given date or closest to it (using both queries as suggested by Dale...with the reversedReadStream and even caching the parsed query etc) . Of course, it made my code a little bit more complex and make the extent grow some more. But it's worth as I need both scenarios to be fast. Thank you very much to all of you. On Tue, Aug 8, 2017 at 12:46 PM, Richard Sargent via Glass <[hidden email]> wrote: GLASS mailing list wrote _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
On 08/09/2017 07:19 AM, Mariano Martinez Peck via Glass wrote:
> For my particular case, I finally decided to keep both collection... a > SortedCollection for getting "last price" and "all price history > sorted" very fast, and a "Bag with indexes" for querying prices for a > given date or closest to it (using both queries as suggested by > Dale...with the reversedReadStream and even caching the parsed query > etc) . Of course, it made my code a little bit more complex and make > the extent grow some more. But it's worth as I need both scenarios to > be fast. > Hi Mariano, To keep the code complexity contained, you can create your own class, each instance of which contains the two collections, and implement whatever nice application-specific query and update methods you need in that class. Regards, -Martin _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Free forum by Nabble | Edit this page |