Spreadsheet representation

classic Classic list List threaded Threaded
4 messages Options
Thierry Thelliez-2 Thierry Thelliez-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Spreadsheet representation

Hello all,

 

We are about to start representing in GemStone some content coming from Spreadsheets, some will  be big (200K rows) but most will be small (~10 rows). They will all have about 30 columns. And probably few hundreds of these spreadsheets.

 

(Actually the biggest ones are probably going to come directly in some sort of data dump yet to be determined, but that’s another story. )

 

Do you have any suggestions for internal representation in GemStone?  We will have to access the data in different ways:

-          A given column for a given row.

-          A subset of the rows matching some criteria (values on some columns)

 

Is there any smarter approach than an Array of Arrays?

 

 

Thanks,

Thierry Thelliez

Stephan Eggermont Stephan Eggermont
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Spreadsheet representation


On 26 sep 2010, at 05:54, Thierry Thelliez wrote:

> Hello all,
>  
> We are about to start representing in GemStone some content coming from Spreadsheets, some will  be big (200K rows) but most will be small (~10 rows). They will all have about 30 columns. And probably few hundreds of these spreadsheets.

30K rows * 30 columns * 300 spreadsheets ~= 300 million cells.

The naive approach might not be optimal. Normally, spreadsheets are sparse, and cell values, styles and formulas repeat a lot.
If they are very sparse, using a column and row array and cells linked up-down and left-right works well. If they are less sparse,
introducing ranges of cells, possibly with a bitmap to indicate which ones are filled, or an R-tree or a form of a Judy Trie

You could avoid storing sum rows/columns.

Stephan



Thierry Thelliez Thierry Thelliez
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Spreadsheet representation

In reply to this post by Thierry Thelliez-2
Thanks Stephan,

In our case all the cells are going to be filled and indeed this is going to present a fair amount of cells. There is no need to store formulas, just strings and numbers. This application is like an inventory system: a lot of models (rows) with variations on attributes (columns). Some (many?) attributes will have the same value. This will help for a couple of columns. 

I will have to read more about the Judy tree idea.

Thanks,
Thierry




Stephan Eggermont Stephan Eggermont
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Spreadsheet representation

In reply to this post by Thierry Thelliez-2
Hello Thiery,

On 27 sep 2010, at 22:27, Thierry Thelliez wrote:
> In our case all the cells are going to be filled and indeed this is going to present a fair amount of cells. There is no need to store formulas, just strings and numbers. This application is like an inventory system: a lot of models (rows) with variations on attributes (columns). Some (many?) attributes will have the same value. This will help for a couple of columns.  

Then it is just like a badly normalized table: extract the columns with
attributes that belong together in a separate table,
count the number of variations and use the number of bits
needed to index.  Split columns with very similar but long values
(e.g. every part number starts with '963 777 355' because that is the
code for the business unit, region and plant where these
parts are made).  

Make sure you hide all this in your implementation, and don't expose
it in the interface.

Stephan


Loading...