Best data structure for a multidimensional database?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Best data structure for a multidimensional database?

Stan Shepherd
Hi, I'm looking at building a small proof of concept of a multidimensional modelling tool in Squeak. Commercial products are things like Cognos, and the old Express that was assimilated by Oracle.

A typical 'cube' will be 'dimensioned' by product, region, time. Each dimension has one or more roll-ups, e.g.

                                             All Ice Creams

                                        /                      \
                              Cornetto                             Tub

                        /             |                 \  
               C. Raspberry   C. Vanilla      C. Chocolate


Then sales data would be entered for the lowest level, then rolled up over the product hierarchy, the region hierarchy, the time hierarchy.

From there, you can ask qusetions like "what's the year on year change in sales of Cornetto for Western Europe".

Two data structures spring to mind:

1) Use nested dictionaries for the dimensions, so that from the sales cube we select the dictionary entry for Cornetto, then from there the entry for Western Europe, then the two entries for this year to date and last year to date, being actual numbers.
2) Give each dimension element a numerical index, eg Cornetto is product no 451 in the product dimension. Sales then becomes a single dictionary where we calculate the index as product number + (region number * number of products) + (period number * number of products * number of regions)

no 2) sounds faster, but no 1) sounds Squeakier. Does anyone have any advice as to how best to do this?

Options 3) etc also welcome.

I daresay the correct answer is to do both and see which works best, but I suspect there are some obvious gotchas I'm not seeing.

Thanks,   Stan

Reply | Threaded
Open this post in threaded view
|

Re: Best data structure for a multidimensional database?

Michael van der Gulik-2


On Wed, Oct 8, 2008 at 4:38 AM, stan shepherd <[hidden email]> wrote:

Hi, I'm looking at building a small proof of concept of a multidimensional
modelling tool in Squeak. Commercial products are things like Cognos, and
the old Express that was assimilated by Oracle.

A typical 'cube' will be 'dimensioned' by product, region, time. Each
dimension has one or more roll-ups, e.g.

                                            All Ice Creams

                                       /                      \
                             Cornetto                             Tub

                       /             |                 \
              C. Raspberry   C. Vanilla      C. Chocolate


Then sales data would be entered for the lowest level, then rolled up over
the product hierarchy, the region hierarchy, the time hierarchy.

>From there, you can ask qusetions like "what's the year on year change in
sales of Cornetto for Western Europe".

Two data structures spring to mind:

1) Use nested dictionaries for the dimensions, so that from the sales cube
we select the dictionary entry for Cornetto, then from there the entry for
Western Europe, then the two entries for this year to date and last year to
date, being actual numbers.
2) Give each dimension element a numerical index, eg Cornetto is product no
451 in the product dimension. Sales then becomes a single dictionary where
we calculate the index as product number + (region number * number of
products) + (period number * number of products * number of regions)

no 2) sounds faster, but no 1) sounds Squeakier. Does anyone have any advice
as to how best to do this?

Options 3) etc also welcome.

I daresay the correct answer is to do both and see which works best, but I
suspect there are some obvious gotchas I'm not seeing.


I'd probably start by looking at how existing multidimensional databases store their data structures, and then try to turn that into objects. Unfortunately, I don't have any experience with this.

What I would do is have a huge unsorted god-collection which has "DimensionalData" objects in it. This would be a place just to get the data into the image in the first place. Each DimensionalData object would store a list of dimension coordinates and then the actual data... somehow. So you'd have an object that would contain: (Chocolate ice cream, Region 123, October 4 at 2pm, 4 ice creams sold). Each of these would contain a point in the dimensional space.

Then I would start trying to find some way of creating "index"s (in the SQL sense) over this raw unsorted data. You could then use these indexes to do queries. Each type of query would need a particular type of index, so you'd have a lot of fun trying to write reusable code for this.

If hierarchies are used quite a lot, then I'd probably try to make a "Tree" class with a parent, children and iteration methods (cf: Collection et al).

Gulik.

--
http://people.squeakfoundation.org/person/mikevdg
http://gulik.pbwiki.com/

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Best data structure for a multidimensional database?

Michael van der Gulik-2


On Wed, Oct 8, 2008 at 10:45 AM, Michael van der Gulik <[hidden email]> wrote:


On Wed, Oct 8, 2008 at 4:38 AM, stan shepherd <[hidden email]> wrote:

Hi, I'm looking at building a small proof of concept of a multidimensional
modelling tool in Squeak. Commercial products are things like Cognos, and
the old Express that was assimilated by Oracle.

A typical 'cube' will be 'dimensioned' by product, region, time. Each
dimension has one or more roll-ups, e.g.

                                            All Ice Creams

                                       /                      \
                             Cornetto                             Tub

                       /             |                 \
              C. Raspberry   C. Vanilla      C. Chocolate


Then sales data would be entered for the lowest level, then rolled up over
the product hierarchy, the region hierarchy, the time hierarchy.

>From there, you can ask qusetions like "what's the year on year change in
sales of Cornetto for Western Europe".

Two data structures spring to mind:

1) Use nested dictionaries for the dimensions, so that from the sales cube
we select the dictionary entry for Cornetto, then from there the entry for
Western Europe, then the two entries for this year to date and last year to
date, being actual numbers.
2) Give each dimension element a numerical index, eg Cornetto is product no
451 in the product dimension. Sales then becomes a single dictionary where
we calculate the index as product number + (region number * number of
products) + (period number * number of products * number of regions)

no 2) sounds faster, but no 1) sounds Squeakier. Does anyone have any advice
as to how best to do this?

Options 3) etc also welcome.

I daresay the correct answer is to do both and see which works best, but I
suspect there are some obvious gotchas I'm not seeing.


I'd probably start by looking at how existing multidimensional databases store their data structures, and then try to turn that into objects. Unfortunately, I don't have any experience with this.

What I would do is have a huge unsorted god-collection which has "DimensionalData" objects in it. This would be a place just to get the data into the image in the first place. Each DimensionalData object would store a list of dimension coordinates and then the actual data... somehow. So you'd have an object that would contain: (Chocolate ice cream, Region 123, October 4 at 2pm, 4 ice creams sold). Each of these would contain a point in the dimensional space.

Then I would start trying to find some way of creating "index"s (in the SQL sense) over this raw unsorted data. You could then use these indexes to do queries. Each type of query would need a particular type of index, so you'd have a lot of fun trying to write reusable code for this.

If hierarchies are used quite a lot, then I'd probably try to make a "Tree" class with a parent, children and iteration methods (cf: Collection et al).


On further thought, provided that the first index you make contains all of the data and you can iterate over it, you don't need the "god collection".

What would be nice is if an API similar to the Collection API could be used to query the db. For example:

db select: [ :each |
    (each time > n)
    and: (each time < m)
    and: (each region in: p) ]

Where "each region in: p" means "each's region is p or a sub-region of p".

The "each" object could be a special object that carefully watches the messages it receives. These messages are the clues to which indexes would be needed (which could be either created new or recycled).

The result of this query could be a "SubIndex" object, which contains start- and end- references into a larger index. This "SubIndex" object would also understand selectors such as >>do:, >>count:, etc to iterate over its entries.

Gulik.

--
http://people.squeakfoundation.org/person/mikevdg
http://gulik.pbwiki.com/

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners