|
Thierry Thelliez-2 |
|
|
Hello all, We are about to start representing in GemStone some content
coming from Spreadsheets, some will be big (200K rows) but most will be small
(~10 rows). They will all have about 30 columns. And probably few hundreds of
these spreadsheets. (Actually the biggest ones are probably going to come
directly in some sort of data dump yet to be determined, but that’s another
story. ) Do you have any suggestions for internal representation in
GemStone? We will have to access the data in different ways: -
A given column for a given row. -
A subset of the rows matching some criteria (values on
some columns) Is there any smarter approach than an Array of Arrays? Thanks, Thierry Thelliez |
|
Stephan Eggermont |
|
|
On 26 sep 2010, at 05:54, Thierry Thelliez wrote: > Hello all, > > We are about to start representing in GemStone some content coming from Spreadsheets, some will be big (200K rows) but most will be small (~10 rows). They will all have about 30 columns. And probably few hundreds of these spreadsheets. 30K rows * 30 columns * 300 spreadsheets ~= 300 million cells. The naive approach might not be optimal. Normally, spreadsheets are sparse, and cell values, styles and formulas repeat a lot. If they are very sparse, using a column and row array and cells linked up-down and left-right works well. If they are less sparse, introducing ranges of cells, possibly with a bitmap to indicate which ones are filled, or an R-tree or a form of a Judy Trie You could avoid storing sum rows/columns. Stephan |
|
Thierry Thelliez |
|
|
In reply to this post by Thierry Thelliez-2
Thanks Stephan,
In our case all the cells are going to be filled and indeed this is going to present a fair amount of cells. There is no need to store formulas, just strings and numbers. This application is like an inventory system: a lot of models (rows) with variations on attributes (columns). Some (many?) attributes will have the same value. This will help for a couple of columns. I will have to read more about the Judy tree idea. Thanks, Thierry |
|
Stephan Eggermont |
|
|
In reply to this post by Thierry Thelliez-2
Hello Thiery,
On 27 sep 2010, at 22:27, Thierry Thelliez wrote: > In our case all the cells are going to be filled and indeed this is going to present a fair amount of cells. There is no need to store formulas, just strings and numbers. This application is like an inventory system: a lot of models (rows) with variations on attributes (columns). Some (many?) attributes will have the same value. This will help for a couple of columns. Then it is just like a badly normalized table: extract the columns with attributes that belong together in a separate table, count the number of variations and use the number of bits needed to index. Split columns with very similar but long values (e.g. every part number starts with '963 777 355' because that is the code for the business unit, region and plant where these parts are made). Make sure you hide all this in your implementation, and don't expose it in the interface. Stephan |
| Powered by Nabble | See how NAML generates this page |