Spreadsheet representation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Spreadsheet representation

Thierry Thelliez-2

Hello all,

 

We are about to start representing in GemStone some content coming from Spreadsheets, some will  be big (200K rows) but most will be small (~10 rows). They will all have about 30 columns. And probably few hundreds of these spreadsheets.

 

(Actually the biggest ones are probably going to come directly in some sort of data dump yet to be determined, but that’s another story. )

 

Do you have any suggestions for internal representation in GemStone?  We will have to access the data in different ways:

-          A given column for a given row.

-          A subset of the rows matching some criteria (values on some columns)

 

Is there any smarter approach than an Array of Arrays?

 

 

Thanks,

Thierry Thelliez

Reply | Threaded
Open this post in threaded view
|

Re: Spreadsheet representation

Stephan Eggermont-3

On 26 sep 2010, at 05:54, Thierry Thelliez wrote:

> Hello all,
>  
> We are about to start representing in GemStone some content coming from Spreadsheets, some will  be big (200K rows) but most will be small (~10 rows). They will all have about 30 columns. And probably few hundreds of these spreadsheets.

30K rows * 30 columns * 300 spreadsheets ~= 300 million cells.

The naive approach might not be optimal. Normally, spreadsheets are sparse, and cell values, styles and formulas repeat a lot.
If they are very sparse, using a column and row array and cells linked up-down and left-right works well. If they are less sparse,
introducing ranges of cells, possibly with a bitmap to indicate which ones are filled, or an R-tree or a form of a Judy Trie

You could avoid storing sum rows/columns.

Stephan



Reply | Threaded
Open this post in threaded view
|

Re: Spreadsheet representation

Thierry Thelliez
In reply to this post by Thierry Thelliez-2
Thanks Stephan,

In our case all the cells are going to be filled and indeed this is going to present a fair amount of cells. There is no need to store formulas, just strings and numbers. This application is like an inventory system: a lot of models (rows) with variations on attributes (columns). Some (many?) attributes will have the same value. This will help for a couple of columns. 

I will have to read more about the Judy tree idea.

Thanks,
Thierry




Reply | Threaded
Open this post in threaded view
|

Re: Spreadsheet representation

Stephan Eggermont-3
In reply to this post by Thierry Thelliez-2
Hello Thiery,

On 27 sep 2010, at 22:27, Thierry Thelliez wrote:
> In our case all the cells are going to be filled and indeed this is going to present a fair amount of cells. There is no need to store formulas, just strings and numbers. This application is like an inventory system: a lot of models (rows) with variations on attributes (columns). Some (many?) attributes will have the same value. This will help for a couple of columns.  

Then it is just like a badly normalized table: extract the columns with
attributes that belong together in a separate table,
count the number of variations and use the number of bits
needed to index.  Split columns with very similar but long values
(e.g. every part number starts with '963 777 355' because that is the
code for the business unit, region and plant where these
parts are made).  

Make sure you hide all this in your implementation, and don't expose
it in the interface.

Stephan