Stream from sorting an indexed collection?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Stream from sorting an indexed collection?

Thelliez
Hello,

The last paragraph of Chapter 5 of the prog guide, p125, introduces
'sortAscending:'.

Reading the code Collection >> sortAscending: comment is:

"Returns an Array containing the elements of the receiver, sorted in ascending
 order, as determined by the values of the instance variables represented by
 aSortSpec."

But assuming a large indexed collection (something like 2 million
objects), how do you implement a memory efficient looping through all
the sorted objects? In other words, wouldn't a resulting Stream be
more efficient than an Array? Something like sortAscendingAsStream: ?


Disclaimers:  I have not done any measurements yet.  Just fishing for ideas.



Thanks,
Thierry
Reply | Threaded
Open this post in threaded view
|

Re: Stream from sorting an indexed collection?

Dale Henrichs
Thierry,

For indexed collections you can use #selectAsStream: to efficiently stream over the collection in sort order (in ascending order). Here's a simple example:

 | set stream result |
 set := #(1 2 3 4 5 6 7 8 9 10) asSet.
 set createEqualityIndexOn: '' withLastElementClass: SmallInteger.
 stream := set selectAsStream: {:each | each > 0 }.
 result := OrderedCollection new.
 [stream atEnd] whileFalse: [ result add: stream next ].
 result

The result will be sorted in ascending order. Of course with the above expression you will traverse the whole stream unless you kick out early.

Using select blocks you can efficiently stream over a range of the collection:

 | set stream result |
 set := #(1 2 3 4 5 6 7 8 9 10) asSet.
 set createEqualityIndexOn: '' withLastElementClass: SmallInteger.
 stream := set selectAsStream: {:each | (each > 4) & (each <= 7)  }.
 result := OrderedCollection new.
 [stream atEnd] whileFalse: [ result add: stream next ].
 result

So I think the basic capabilities are there already ... the answer is to no use sortAscending...

Dale

----- Original Message -----
| From: "Thierry Thelliez" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Monday, December 12, 2011 10:55:51 AM
| Subject: [GS/SS Beta] Stream from sorting an indexed collection?
|
| Hello,
|
| The last paragraph of Chapter 5 of the prog guide, p125, introduces
| 'sortAscending:'.
|
| Reading the code Collection >> sortAscending: comment is:
|
| "Returns an Array containing the elements of the receiver, sorted in
| ascending
|  order, as determined by the values of the instance variables
|  represented by
|  aSortSpec."
|
| But assuming a large indexed collection (something like 2 million
| objects), how do you implement a memory efficient looping through all
| the sorted objects? In other words, wouldn't a resulting Stream be
| more efficient than an Array? Something like sortAscendingAsStream: ?
|
|
| Disclaimers:  I have not done any measurements yet.  Just fishing for
| ideas.
|
|
|
| Thanks,
| Thierry
|
Reply | Threaded
Open this post in threaded view
|

Re: Stream from sorting an indexed collection?

Thelliez
Dale,

Thank you for your answer.  I have a few more questions about indexing.


1- Adding to an indexed collection:  On top of page 118 of the programmer guide, it is said that creating indexes takes a lot of resources.  I am assuming that it is ok to create the index on a small collection (maybe when it reaches 2000 objects) and then just add objects to this indexed Set, correct? Adding objects to an already indexed collection should not take too much time?

2- Removing objects: Are there any special considerations when removing objects to an indexed collection?

3- Modifying an object: The bottom part of p118 is not clear. I am assuming that it is safe to modify an object in an indexed collection as long as the instance variables used for the indexing are not modified, correct?

4- Modifying an object in one index out of two:  With your suggestion of using an index for sorting (see previous email), it looks like I should create one index for the 'key' (String) and one index for the result sorting (DateTime).  The 'key' is not going to change, but the date instance variable could. On page 110 I see that one should call removeObjectFromBtrees before updating an object.  But how does gemstone knows that I want to keep one index untouched? The text does not address the case of multiple indexes.

5- Composite key: Our objects are uniquely identified by the combination several 'fields' (up to 10).  To build an index I am planning on concatenating these fields as String in a long String to get a unique ID, probably 200 to 300 characters long (less than 900). I read that immediateInvariant should be used to speed up the search.  Any issues with these long strings? Or should I consider having a separate index on the different key components (small strings) and search with a conjunction operator?


Cheers,
Thierry





On Mon, Dec 12, 2011 at 12:50 PM, Dale Henrichs <[hidden email]> wrote:
> Thierry,
>
> For indexed collections you can use #selectAsStream: to efficiently stream over the collection in sort order (in ascending order). Here's a simple example:
>
>  | set stream result |
>  set := #(1 2 3 4 5 6 7 8 9 10) asSet.
>  set createEqualityIndexOn: '' withLastElementClass: SmallInteger.
>  stream := set selectAsStream: {:each | each > 0 }.
>  result := OrderedCollection new.
>  [stream atEnd] whileFalse: [ result add: stream next ].
>  result
>
> The result will be sorted in ascending order. Of course with the above expression you will traverse the whole stream unless you kick out early.
>
> Using select blocks you can efficiently stream over a range of the collection:
>
>  | set stream result |
>  set := #(1 2 3 4 5 6 7 8 9 10) asSet.
>  set createEqualityIndexOn: '' withLastElementClass: SmallInteger.
>  stream := set selectAsStream: {:each | (each > 4) & (each <= 7)  }.
>  result := OrderedCollection new.
>  [stream atEnd] whileFalse: [ result add: stream next ].
>  result
>
> So I think the basic capabilities are there already ... the answer is to no use sortAscending...
>
> Dale
>
> ----- Original Message -----
> | From: "Thierry Thelliez" <[hidden email]>
> | To: "GemStone Seaside beta discussion" <[hidden email]>
> | Sent: Monday, December 12, 2011 10:55:51 AM
> | Subject: [GS/SS Beta] Stream from sorting an indexed collection?
> |
> | Hello,
> |
> | The last paragraph of Chapter 5 of the prog guide, p125, introduces
> | 'sortAscending:'.
> |
> | Reading the code Collection >> sortAscending: comment is:
> |
> | "Returns an Array containing the elements of the receiver, sorted in
> | ascending
> |  order, as determined by the values of the instance variables
> |  represented by
> |  aSortSpec."
> |
> | But assuming a large indexed collection (something like 2 million
> | objects), how do you implement a memory efficient looping through all
> | the sorted objects? In other words, wouldn't a resulting Stream be
> | more efficient than an Array? Something like sortAscendingAsStream: ?
> |
> |
> | Disclaimers:  I have not done any measurements yet.  Just fishing for
> | ideas.
> |
> |
> |
> | Thanks,
> | Thierry
> |

Reply | Threaded
Open this post in threaded view
|

Re: Stream from sorting an indexed collection?

Dale Henrichs
Thierry,

1- Adding to an indexed collection :

  There is not that much difference between creating an index on a collection of 2000000 elements and creating the index on a small collection and then adding 2000000 elements... The code paths are pretty much the same ... If you create the index on a large collection then you need to make sure that you turn on autoCommit for the duration of the index creation step:

  IndexManager autoCommit: true

The expense is coming from the large number of objects that get created and moved around during the initial index creation step.

2- Removing objects :

  No special considerations for removing elements.

  It is worth noting that you when you create an index there is a strong reference to that collection from `IndexManager current`. If you intend to drop a collection on the floor that has been previously indexed you need to remove the index before the collection will be garbage collected

3- Modifying an object:

  It is perfectly safe to modify objects that are part of index collections ... One of the features of GemStone indexes is automatic index maintenance, which means that once you have created an indexed collection, GemStone will manage the index for you from there on out.

4- Modifying an object in one index out of two :

  The section surround page 110 is describing what you should do if you have redefined the equality/comparison operators for a class. If you change the programmatic definition of #= or #<= then you need to remove and add all of the elements of that class, so they will be properly sorted.

  In normal use it is unnecessary to worry about whether or not an object is in an index when modifying the values, no matter how many indexes the object participates in.

5- Composite key :

  There are performance related issues to consider when using CharacterCollections in an index:

    - RangeEqualityIndex class>>_maxCharsForStringComparison (900) specifies
      the max number of character that are used for comparison ... the
      characters beyond the max are ignored for index comparisons

    - The first 10 bytes of a String are cached directly in the index data
      structure to avoid having to fault the String into memory for comparisons.
      The implication is that for maximum performance you should try to keep
      string keys to 10 bytes or less.

    - SmallIntegers, SmallDoubles, Time, and DateTime instances can be fully
      encoded in the BTree structure, so converting your unique ID to a SmallDouble
      or SmallInteger would give you superior performance.

The encoded BTrees can contain up to 500 keys per page. On average you will make 250 comparisons to find a single element.

If you are using more than 10 bytes in the String, you can end up with 250 page reads to find a single element.

If you are using 10 bytes or less, it will take 0 page reads to find the element on the page.

If you are working with 10 separate indexes and can leverage the encoded keys, it will cost you roughly 10 page reads to find a single element. You will incur the cost of traversing 10 trees instead of one, but with a 25X advantage you have some wiggle room...

In the end though, you'll need to do some testing on your own with large collections, cold caches and representative queries to determine the best scheme for your situation.

Dale

----- Original Message -----
| From: "Thierry Thelliez" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Monday, December 12, 2011 3:21:45 PM
| Subject: Re: [GS/SS Beta] Stream from sorting an indexed collection?
|
| Dale,
|
| Thank you for your answer. I have a few more questions about
| indexing.
|
|
| 1- Adding to an indexed collection : On top of page 118 of the
| programmer guide, it is said that creating indexes takes a lot of
| resources. I am assuming that it is ok to create the index on a
| small collection (maybe when it reaches 2000 objects) and then just
| add objects to this indexed Set, correct? Adding objects to an
| already indexed collection should not take too much time?
|
| 2- Removing objects : Are there any special considerations when
| removing objects to an indexed collection?
|
| 3- Modifying an object: The bottom part of p118 is not clear. I am
| assuming that it is safe to modify an object in an indexed
| collection as long as the instance variables used for the indexing
| are not modified, correct?
|
|
|
| 4- Modifying an object in one index out of two : With your suggestion
| of using an index for sorting (see previous email), it looks like I
| should create one index for the 'key' (String) and one index for the
| result sorting (DateTime). The 'key' is not going to change, but the
| date instance variable could. On page 110 I see that one should call
| removeObjectFromBtrees before updating an object. But how does
| gemstone knows that I want to keep one index untouched? The text
| does not address the case of multiple indexes.
|
|
|
| 5- Composite key : Our objects are uniquely identified by the
| combination several 'fields' (up to 10). To build an index I am
| planning on concatenating these fields as String in a long String to
| get a unique ID, probably 200 to 300 characters long (less than
| 900). I read that immediateInvariant should be used to speed up the
| search. Any issues with these long strings? Or should I consider
| having a separate index on the different key components (small
| strings) and search with a conjunction operator?
|
|
|
|
|
| Cheers,
| Thierry
|
|
|
|
| On Mon, Dec 12, 2011 at 12:50 PM, Dale Henrichs < [hidden email]
| > wrote:
| > Thierry,
| >
| > For indexed collections you can use #selectAsStream: to efficiently
| > stream over the collection in sort order (in ascending order).
| > Here's a simple example:
| >
| > | set stream result |
| > set := #(1 2 3 4 5 6 7 8 9 10) asSet.
| > set createEqualityIndexOn: '' withLastElementClass: SmallInteger.
| > stream := set selectAsStream: {:each | each > 0 }.
| > result := OrderedCollection new.
| > [stream atEnd] whileFalse: [ result add: stream next ].
| > result
| >
| > The result will be sorted in ascending order. Of course with the
| > above expression you will traverse the whole stream unless you
| > kick out early.
| >
| > Using select blocks you can efficiently stream over a range of the
| > collection:
| >
| > | set stream result |
| > set := #(1 2 3 4 5 6 7 8 9 10) asSet.
| > set createEqualityIndexOn: '' withLastElementClass: SmallInteger.
| > stream := set selectAsStream: {:each | (each > 4) & (each <= 7) }.
| > result := OrderedCollection new.
| > [stream atEnd] whileFalse: [ result add: stream next ].
| > result
| >
| > So I think the basic capabilities are there already ... the answer
| > is to no use sortAscending...
| >
| > Dale
| >
| > ----- Original Message -----
| > | From: "Thierry Thelliez" < [hidden email] >
| > | To: "GemStone Seaside beta discussion" <
| > | [hidden email] >
| > | Sent: Monday, December 12, 2011 10:55:51 AM
| > | Subject: [GS/SS Beta] Stream from sorting an indexed collection?
| > |
| > | Hello,
| > |
| > | The last paragraph of Chapter 5 of the prog guide, p125,
| > | introduces
| > | 'sortAscending:'.
| > |
| > | Reading the code Collection >> sortAscending: comment is:
| > |
| > | "Returns an Array containing the elements of the receiver, sorted
| > | in
| > | ascending
| > | order, as determined by the values of the instance variables
| > | represented by
| > | aSortSpec."
| > |
| > | But assuming a large indexed collection (something like 2 million
| > | objects), how do you implement a memory efficient looping through
| > | all
| > | the sorted objects? In other words, wouldn't a resulting Stream
| > | be
| > | more efficient than an Array? Something like
| > | sortAscendingAsStream: ?
| > |
| > |
| > | Disclaimers: I have not done any measurements yet. Just fishing
| > | for
| > | ideas.
| > |
| > |
| > |
| > | Thanks,
| > | Thierry
| > |
|
|
Reply | Threaded
Open this post in threaded view
|

Re: Stream from sorting an indexed collection?

Thelliez
Dale,

Thanks a lot for your detailed explanations!

When you say that the first 10 bytes are cached in the index, does that also mean that we should find a way to have the most significant digits first. Our IDs look like very long serial numbers.  Often the difference is within the last digits. 


The surprise to me is that the system performed ok without indexing!  We had a demo last week for which I thought that indexing would be a Must_Have.  It took 25 minutes to process 1.3M objects.  Since this is for a batch run to generate reports, the performance are acceptable for now.   This was with using the curly brackets and making the key string 'immediateInvariant'.  Also, I redesigned our system to have a tree of collections rather than a large flat one.  That seems to have helped.



Thanks,
Thierry
Reply | Threaded
Open this post in threaded view
|

Re: Stream from sorting an indexed collection?

Dale Henrichs
Thierry,

Yes the first 10 bytes are compared first, so you want the significant digits to come first.

Glad to hear that you aren't hitting performance bottlenecks right out of the gate!

Dale

----- Original Message -----
| From: "Thierry Thelliez" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Tuesday, December 20, 2011 8:10:22 AM
| Subject: Re: [GS/SS Beta] Stream from sorting an indexed collection?
|
|
| Dale,
|
|
|
| Thanks a lot for your detailed explanations!
|
|
| When you say that the first 10 bytes are cached in the index, does
| that also mean that we should find a way to have the most
| significant digits first. Our IDs look like very long serial
| numbers. Often the difference is within the last digits.
|
|
|
|
| The surprise to me is that the system performed ok without indexing!
| We had a demo last week for which I thought that indexing would be a
| Must_Have. It took 25 minutes to process 1.3M objects. Since this is
| for a batch run to generate reports, the performance are acceptable
| for now. This was with using the curly brackets and making the key
| string 'immediateInvariant'. Also, I redesigned our system to have a
| tree of collections rather than a large flat one. That seems to have
| helped.
|
|
|
|
|
|
| Thanks,
| Thierry
Reply | Threaded
Open this post in threaded view
|

Re: Stream from sorting an indexed collection?

Thelliez
Hello,

On the following page the following 2000 threshold is described:
'Indexes generally should not be created unless the collection is larger than about 2000 elements in size.'

I created a subclass of Set with the following add: method
 ( self size > 2000 )
ifTrue: [ self equalityIndexedPaths isEmpty
ifTrue: [ IndexManager autoCommit: true.
self createEqualityIndexOn: 'key' withLastElementClass: String.  
self createEqualityIndexOn: 'created' withLastElementClass: DateTime.] ].
^ super add: anObject.

This seems to work fine: 5x improvement.

But I have two questions.
1- As previoulsy described, the keys are the aggregation of up to 7 strings.  My aggregated keys are 40 to 100 character long. Would using the hash String method makes sense? Never used it before.

2- The above code leaves me with a mix of indexed and non-indexed collections.  Transferring the non-indexed ones in s Sorted collection for reporting based on the 'created' value is a little bit slow. The size of my collections is unbalanced: few big ones, but also several with only a few hundred elements. Would it make more sense to create a model where data is stored in a SortedCollection up to 2000 elements, and then dump the elements in an Indexed Set after that?


Thanks,
Thierry

Reply | Threaded
Open this post in threaded view
|

Re: Stream from sorting an indexed collection?

James Foster-8
Hi Thierry,

Comments below...

On Dec 23, 2011, at 3:54 PM, Thierry Thelliez wrote:

> Hello,
>
> On the following page the following 2000 threshold is described:
> 'Indexes generally should not be created unless the collection is larger than about 2000 elements in size.'
>
> I created a subclass of Set with the following add: method
>  ( self size > 2000 )
> ifTrue: [ self equalityIndexedPaths isEmpty
> ifTrue: [ IndexManager autoCommit: true.
> self createEqualityIndexOn: 'key' withLastElementClass: String.  
> self createEqualityIndexOn: 'created' withLastElementClass: DateTime.] ].
> ^ super add: anObject.
>
> This seems to work fine: 5x improvement.

Lazy creation of indexes seems to me to be a bit surprising. Having to evaluate #'equalityIndexedPaths' to do an #'add:' might be expensive (thought I haven't measured it); changing the value of autoCommit as a side-effect to #'add:' is unexpected; and committing a transaction as a side-effect of doing an #'add:' might be wrong.

>
> But I have two questions.
> 1- As previoulsy described, the keys are the aggregation of up to 7 strings.  My aggregated keys are 40 to 100 character long. Would using the hash String method makes sense? Never used it before.

Typically #'hash' is used to distribute objects among different buckets (usually for a Dictionary or Set); I don't think that it has anything to do with equality indexing.

> 2- The above code leaves me with a mix of indexed and non-indexed collections.  Transferring the non-indexed ones in s Sorted collection for reporting based on the 'created' value is a little bit slow. The size of my collections is unbalanced: few big ones, but also several with only a few hundred elements. Would it make more sense to create a model where data is stored in a SortedCollection up to 2000 elements, and then dump the elements in an Indexed Set after that?

It seems to me that most applications will have relatively predictable collection sizes (at least to an order of magnitude). Indexing will require some knowledge of the domain (including the class and instance variable names), so can't really be done automatically. It has been my experience that refactoring to created (and use) indexes can wait till there is some measurable slowness in the application.

Use of SortedCollections raises similar issues since you are trading update (add/remove) time for reporting time. If a collection is rarely modified and frequently reported (e.g., a list of countries/states/provinces), then sorting during update is better; if a collection is frequently modified and rarely reported (e.g., users currently logged in), then sorting during reporting is better. Without knowing the application, we can't say whether use of a SortedCollection is desirable.

-James

> Thanks,
> Thierry
Reply | Threaded
Open this post in threaded view
|

Re: Stream from sorting an indexed collection?

Thelliez
James,

Sorry I did not get back in touch sooner... very busy with a deliverable. Everything works fine for now.  Here are just few notes about our domain.

1- Lazy indexing
The lazy indexing scheme seems to be working fine so far.   I ran a quick test to look at our collection sizes.  Below is a dictionary with the key being the number of digits for the collections size  and the value being the number of collections

aDictionary( 1->3390, 3->391, 4->93, 5->9, 6->1, 2->1202, 'total'->879355)

In other words, we have 3390 collections containing 1 to 9 objects, 1202 collections containing 10 to 99 objects,....  And we have 1 collection containing 400K objects. Starting the project we really did not know what growth we will be seeing. And still some external events might trigger the rapid growth of some of the smaller collections.

For now we have a total of 1.5M objects, but only 879K are kept in this representation because of various business rules. The collections, exported as files, can be seen at http://www.regulations.doe.gov/certification-data.

The overhead of checking the size and indexes does not seem to be a big issue. Maybe it could be optimized using a 'become' once the collection reaches 2000 objects?

The auto-commit under the add: could indeed be a problem, but not in this case.

Instead of this collection of collections, we could have flatten everything in one big collection and rely on multiple indexes.  We will see if performance degrades.

2- Hash
I think that we are talking about the same thing.
Comparing the long composite keys is ok right now (1h30 nightly batch run). But I was thinking of eventually using the hash method to accelerate the search:
a- store a hash result of the long composite key as the new primary key.
b- 'collect' all the objects with the same (hash) key. That should be very fast from what I understood in Dale's email.
c- iterate through this (hopefully small) collection to find an exact match with the full composite key.


Thierry


Reply | Threaded
Open this post in threaded view
|

Re: Stream from sorting an indexed collection?

James Foster-8
Thierry,

As to hashing, if you want to do lookups on a key and don't need the keys to be sorted, then a Dictionary may be a good option.

-James

On Jan 9, 2012, at 9:46 AM, Thierry Thelliez wrote:

James,

Sorry I did not get back in touch sooner... very busy with a deliverable. Everything works fine for now.  Here are just few notes about our domain.

1- Lazy indexing
The lazy indexing scheme seems to be working fine so far.   I ran a quick test to look at our collection sizes.  Below is a dictionary with the key being the number of digits for the collections size  and the value being the number of collections

aDictionary( 1->3390, 3->391, 4->93, 5->9, 6->1, 2->1202, 'total'->879355)

In other words, we have 3390 collections containing 1 to 9 objects, 1202 collections containing 10 to 99 objects,....  And we have 1 collection containing 400K objects. Starting the project we really did not know what growth we will be seeing. And still some external events might trigger the rapid growth of some of the smaller collections.

For now we have a total of 1.5M objects, but only 879K are kept in this representation because of various business rules. The collections, exported as files, can be seen at http://www.regulations.doe.gov/certification-data.

The overhead of checking the size and indexes does not seem to be a big issue. Maybe it could be optimized using a 'become' once the collection reaches 2000 objects?

The auto-commit under the add: could indeed be a problem, but not in this case.

Instead of this collection of collections, we could have flatten everything in one big collection and rely on multiple indexes.  We will see if performance degrades.

2- Hash
I think that we are talking about the same thing.
Comparing the long composite keys is ok right now (1h30 nightly batch run). But I was thinking of eventually using the hash method to accelerate the search:
a- store a hash result of the long composite key as the new primary key.
b- 'collect' all the objects with the same (hash) key. That should be very fast from what I understood in Dale's email.
c- iterate through this (hopefully small) collection to find an exact match with the full composite key.


Thierry



Reply | Threaded
Open this post in threaded view
|

Re: Stream from sorting an indexed collection?

Thelliez
Hello,

Just a follow up on using String hashes.

Following Dale's explanation about the impact of long strings used as keys, I replaced our long composite keys with their hash (+ a mechanism to detect collision, if any).  The speed improvement is between 10 to 15%. That's for 1.5M objects.


Thierry