Smalltalk › Frameworks & Tools › SciSmalltalk

Add class PMStandardizationScaler

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

8 messages Options

SergeStinckwich

Add class PMStandardizationScaler

Dear all,

in order to clean data before using Machine Learning algorithms (like PCA), I implement a class to do data standardization: 

PMStandardizationScaler

https://github.com/PolyMathOrg/PolyMath/blob/development/src/Math-PrincipalComponentAnalysis/PMStandardizationScaler.class.st

​Data can be centered and scaled. The class is similar to the one defined in scikit-learn : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html​

​

More complex data transformation can be implemented as subclass of the abstract class PMDataTransformer. You have to implement fit: and transform: method.​

​

A+​

Serge Stinckwich
UMI UMMISCO 209 (SU/IRD/UY1)

"Programs must be written for people to read, and only incidentally for machines to execute."

http://www.doesnotunderstand.org/

--
You received this message because you are subscribed to the Google Groups "PolyMath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

werner kassens-2

Re: Add class PMStandardizationScaler

Hi Serge,
i find the PMStandardizationScaler is a extremely useful thing, hence please excuse me making a few comments:
#transform: uses self scale & self mean repeatedly. self scale calcs the covarianceMatrix repeatedly which can be a lot of work, perhaps a local var would speed that up.
calculating the whole covariance matrix just for some stdevs, how much programming does that save <g>, more than a half line of code?
the covariance thing calcs, i think, variance by dividing by size not (size - 1), is that your intention?
i like the architecture, it is simple and obvious to use.
i can use it, thanks!
werner

On Tuesday, July 31, 2018 at 10:20:40 AM UTC+2, Serge Stinckwich wrote:

Dear all,

in order to clean data before using Machine Learning algorithms (like PCA), I implement a class to do data standardization:
PMStandardizationScaler
<a href="https://github.com/PolyMathOrg/PolyMath/blob/development/src/Math-PrincipalComponentAnalysis/PMStandardizationScaler.class.st" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FPolyMathOrg%2FPolyMath%2Fblob%2Fdevelopment%2Fsrc%2FMath-PrincipalComponentAnalysis%2FPMStandardizationScaler.class.st\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGDhC8zNQnuiWlnplWc5WIhJs3Rew';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FPolyMathOrg%2FPolyMath%2Fblob%2Fdevelopment%2Fsrc%2FMath-PrincipalComponentAnalysis%2FPMStandardizationScaler.class.st\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGDhC8zNQnuiWlnplWc5WIhJs3Rew';return true;">https://github.com/PolyMathOrg/PolyMath/blob/development/src/Math-PrincipalComponentAnalysis/PMStandardizationScaler.class.st

Data can be centered and scaled. The class is similar to the one defined in scikit-learn : <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\x3dhttp%3A%2F%2Fscikit-learn.org%2Fstable%2Fmodules%2Fgenerated%2Fsklearn.preprocessing.StandardScaler.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNF0I0IZr9lY5v2lnXbkSImQ8RWivA';return true;" onclick="this.href='http://www.google.com/url?q\x3dhttp%3A%2F%2Fscikit-learn.org%2Fstable%2Fmodules%2Fgenerated%2Fsklearn.preprocessing.StandardScaler.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNF0I0IZr9lY5v2lnXbkSImQ8RWivA';return true;">http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

More complex data transformation can be implemented as subclass of the abstract class PMDataTransformer. You have to implement fit: and transform: method.

A+
--
Serge Stinckwich
UMI UMMISCO 209 (SU/IRD/UY1)
"Programs must be written for people to read, and only incidentally for machines to execute."
<a href="http://www.doesnotunderstand.org/" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.doesnotunderstand.org%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEmratUCplMDry5gDXSals1R-iPLw';return true;" onclick="this.href='http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.doesnotunderstand.org%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEmratUCplMDry5gDXSals1R-iPLw';return true;">http://www.doesnotunderstand.org/

SergeStinckwich

Re: Add class PMStandardizationScaler

On Tue, Jul 31, 2018 at 2:43 PM werner kassens <[hidden email]> wrote:

Hi Serge,
i find the PMStandardizationScaler is a extremely useful thing, hence please excuse me making a few comments:
#transform: uses self scale & self mean repeatedly. self scale calcs the covarianceMatrix repeatedly which can be a lot of work, perhaps a local var would speed that up.

​ok done in last version.

​

calculating the whole covariance matrix just for some stdevs, how much programming does that save <g>, more than a half line of code?
the covariance thing calcs, i think, variance by dividing by size not (size - 1), is that your intention?

​I don't understand what you want to say here.

​

i like the architecture, it is simple and obvious to use.
i can use it, thanks!

​This is the same pattern used in scikit learn in StandardScaler: 

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html​

​Would be nice to add more transformers like http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing​

Serge Stinckwich
UMI UMMISCO 209 (SU/IRD/UY1)

"Programs must be written for people to read, and only incidentally for machines to execute."

http://www.doesnotunderstand.org/

werner kassens-2

Re: Add class PMStandardizationScaler

On Tue, Jul 31, 2018 at 4:03 PM, Serge Stinckwich <[hidden email]> wrote:

calculating the whole covariance matrix just for some stdevs, how much programming does that save <g>, more than a half line of code?
the covariance thing calcs, i think, variance by dividing by size not (size - 1), is that your intention?

I don't understand what you want to say here.

collection>>stdev
    | avg sample sum |
    avg := self average.
    "see comment in self sum"
    sample := self anyOne.
    sum := self inject: sample into: [:accum :each | accum + (each - avg) squared].
    sum := sum - sample.
    ^ (sum / (self size - 1)) sqrt

the method you use does probably:
^ (sum / self size) sqrt

werner

SergeStinckwich

Re: Add class PMStandardizationScaler

On Tue, Jul 31, 2018 at 4:56 PM werner kassens <[hidden email]> wrote:

On Tue, Jul 31, 2018 at 4:03 PM, Serge Stinckwich <[hidden email]> wrote:
calculating the whole covariance matrix just for some stdevs, how much programming does that save <g>, more than a half line of code?
the covariance thing calcs, i think, variance by dividing by size not (size - 1), is that your intention?

I don't understand what you want to say here.

collection>>stdev
    | avg sample sum |
    avg := self average.
    "see comment in self sum"
    sample := self anyOne.
    sum := self inject: sample into: [:accum :each | accum + (each - avg) squared].
    sum := sum - sample.
    ^ (sum / (self size - 1)) sqrt

the method you use does probably:
^ (sum / self size) sqrt

​Sorry I still don't understand. Where do you see a problem ?

PMStandardizationScaler use an iterative way to compute mean and variance with PMCovarianceAccumulator.

So we need to compute covariance matrix in order to compute variance.

A+

Serge Stinckwich
UMI UMMISCO 209 (SU/IRD/UY1)

"Programs must be written for people to read, and only incidentally for machines to execute."

http://www.doesnotunderstand.org/

werner kassens-2

Re: Add class PMStandardizationScaler

i guess its a misunderstanding, i dont see no problem, sorry for the irritation, Serge.

werner

On Tue, Jul 31, 2018 at 6:23 PM, Serge Stinckwich <[hidden email]> wrote:

On Tue, Jul 31, 2018 at 4:56 PM werner kassens <[hidden email]> wrote:
On Tue, Jul 31, 2018 at 4:03 PM, Serge Stinckwich <[hidden email]> wrote:
calculating the whole covariance matrix just for some stdevs, how much programming does that save <g>, more than a half line of code?
the covariance thing calcs, i think, variance by dividing by size not (size - 1), is that your intention?

I don't understand what you want to say here.

collection>>stdev
    | avg sample sum |
    avg := self average.
    "see comment in self sum"
    sample := self anyOne.
    sum := self inject: sample into: [:accum :each | accum + (each - avg) squared].
    sum := sum - sample.
    ^ (sum / (self size - 1)) sqrt

the method you use does probably:
^ (sum / self size) sqrt

Sorry I still don't understand. Where do you see a problem ?
PMStandardizationScaler use an iterative way to compute mean and variance with PMCovarianceAccumulator.
So we need to compute covariance matrix in order to compute variance.

A+
--
Serge Stinckwich
UMI UMMISCO 209 (SU/IRD/UY1)
"Programs must be written for people to read, and only incidentally for machines to execute."
http://www.doesnotunderstand.org/

--
You received this message because you are subscribed to the Google Groups "PolyMath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

Nicolas Cellier

Re: Add class PMStandardizationScaler

Hi Serge,

sum( (x_i - average)^2 for i=1:n) / n is a biased estimator of variance.

One must divide by n-1 to obtain an unbiased estimator.

That's probably what Werner means.

See https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

Apart the bias, I have verified the formula, and it sounds correct.

But it would deserve an explanation (or a reference to an explanation, in Didier Besset Book?), because it is non trivial.

delta_{n+1} is the difference of (average_estimate_{n} - vector_{n+1}).

We want to use (average_estimate_{n+1} - vector_{n+1}) in the covariance estimator.

But then we should also compensate the evolution of average estimate in previous accumulation...

Let's do it in scalar first:

sum( (x_i - average_estimate_{n+1})^2/n ) = ( sum( x_i^2)/n - 2*sum(x_i)/n*average_estimate_{n+1}+sum(average_estimate_{n+1}^2)/n )

... = sum( x_i^2)/n + average_estimate_{n+1}^2 - 2*average_estimate_{n+1}*average_estimate_n

Since we have computed:

variance_estimate_n = sum( (x_i - average_estimate_n)^2/n ) = sum( x_i^2)/n - average_estimate_n^2

Then compensating the error requires taking:

variance_estimate_n_corrected = variance_estimate_n + (average_estimate_{n+1}-average_estimate_n)^2

... = variance_estimate_n + delta_{n+1}^2

Then, updating the variance with new accumulated value, with biased estimator:

variance_estimate_{n+1} = (variance_estimate_n_corrected * n + (average_estimate_{n+1} - x_{n+1})^2) / (n+1)

average_estimate_{n+1} = average_estimate_n - delta_{n+1}

average_estimate_{n+1} - x_{n+1} = average_estimate_n - x_{n+1} - delta_{n+1}

... = {n+1)*delta_{n+1} - delta_{n+1}

... = n * delta_{n+1}

So, if I did not messed up so far:

variance_estimate_{n+1} = (n * variance_estimate_n + n * delta_{n+1}^2 + n^2*delta_{n+1}) / (n+1)

IOW:

variance_estimate_{n+1} = variance_estimate_n * n/(n+1) + n*delta_{n+1}^2

This can be extended to covariance, and we indeed find the iterative formula which is programmed.

2018-07-31 18:33 GMT+02:00 werner kassens <[hidden email]>:

i guess its a misunderstanding, i dont see no problem, sorry for the irritation, Serge.
werner

On Tue, Jul 31, 2018 at 6:23 PM, Serge Stinckwich <[hidden email]> wrote:

On Tue, Jul 31, 2018 at 4:56 PM werner kassens <[hidden email]> wrote:
On Tue, Jul 31, 2018 at 4:03 PM, Serge Stinckwich <[hidden email]> wrote:
calculating the whole covariance matrix just for some stdevs, how much programming does that save <g>, more than a half line of code?
the covariance thing calcs, i think, variance by dividing by size not (size - 1), is that your intention?

I don't understand what you want to say here.

collection>>stdev
    | avg sample sum |
    avg := self average.
    "see comment in self sum"
    sample := self anyOne.
    sum := self inject: sample into: [:accum :each | accum + (each - avg) squared].
    sum := sum - sample.
    ^ (sum / (self size - 1)) sqrt

the method you use does probably:
^ (sum / self size) sqrt

Sorry I still don't understand. Where do you see a problem ?
PMStandardizationScaler use an iterative way to compute mean and variance with PMCovarianceAccumulator.
So we need to compute covariance matrix in order to compute variance.

A+
--
Serge Stinckwich
UMI UMMISCO 209 (SU/IRD/UY1)
"Programs must be written for people to read, and only incidentally for machines to execute."
http://www.doesnotunderstand.org/

--
You received this message because you are subscribed to the Google Groups "PolyMath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PolyMath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

SergeStinckwich

Re: Add class PMStandardizationScaler

On Tue, Jul 31, 2018 at 10:28 PM Nicolas Cellier <[hidden email]> wrote:

Hi Serge,

sum( (x_i - average)^2 for i=1:n) / n is a biased estimator of variance.
One must divide by n-1 to obtain an unbiased estimator.
That's probably what Werner means.
See https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

Apart the bias, I have verified the formula, and it sounds correct.
But it would deserve an explanation (or a reference to an explanation, in Didier Besset Book?), because it is non trivial.

delta_{n+1} is the difference of (average_estimate_{n} - vector_{n+1}).
We want to use (average_estimate_{n+1} - vector_{n+1}) in the covariance estimator.
But then we should also compensate the evolution of average estimate in previous accumulation...

Let's do it in scalar first:

sum( (x_i - average_estimate_{n+1})^2/n ) = ( sum( x_i^2)/n - 2*sum(x_i)/n*average_estimate_{n+1}+sum(average_estimate_{n+1}^2)/n )
... = sum( x_i^2)/n + average_estimate_{n+1}^2 - 2*average_estimate_{n+1}*average_estimate_n

Since we have computed:
variance_estimate_n = sum( (x_i - average_estimate_n)^2/n ) = sum( x_i^2)/n - average_estimate_n^2

Then compensating the error requires taking:
variance_estimate_n_corrected = variance_estimate_n + (average_estimate_{n+1}-average_estimate_n)^2
... = variance_estimate_n + delta_{n+1}^2

Then, updating the variance with new accumulated value, with biased estimator:
variance_estimate_{n+1} = (variance_estimate_n_corrected * n + (average_estimate_{n+1} - x_{n+1})^2) / (n+1)

average_estimate_{n+1} = average_estimate_n - delta_{n+1}
average_estimate_{n+1} - x_{n+1} = average_estimate_n - x_{n+1} - delta_{n+1}
... = {n+1)*delta_{n+1} - delta_{n+1}
... = n * delta_{n+1}

So, if I did not messed up so far:
variance_estimate_{n+1} = (n * variance_estimate_n + n * delta_{n+1}^2 + n^2*delta_{n+1}) / (n+1)

IOW:
variance_estimate_{n+1} = variance_estimate_n * n/(n+1) + n*delta_{n+1}^2

This can be extended to covariance, and we indeed find the iterative formula which is programmed.

Thank you for the math, Nicolas !

Actually, Collection>>stdev is not part of PolyMath but Pharo.

Collection>>stdev
    | avg sample sum |
    "In statistics, the standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values.
    For details about implementation see comment in self sum."
    avg := self average.
    sample := self anyOne.
    sum := self inject: sample into: [ :accum :each | accum + (each - avg) squared ].
    sum := sum - sample.
    ^ (sum / (self size - 1)) sqrt

Serge Stinckwich
UMI UMMISCO 209 (SU/IRD/UY1)

"Programs must be written for people to read, and only incidentally for machines to execute."

http://www.doesnotunderstand.org/