Dear all, in order to clean data before using Machine Learning algorithms (like PCA), I implement a class to do data standardization: PMStandardizationScaler https://github.com/PolyMathOrg/PolyMath/blob/development/src/Math-PrincipalComponentAnalysis/PMStandardizationScaler.class.stData can be centered and scaled. The class is similar to the one defined in scikit-learn : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html More complex data transformation can be implemented as subclass of the abstract class PMDataTransformer. You have to implement fit: and transform: method. A+ -- Serge Stinckwich UMI UMMISCO 209 (SU/IRD/UY1)
Hi Serge,
-- i find the PMStandardizationScaler is a extremely useful thing, hence please excuse me making a few comments: #transform: uses self scale & self mean repeatedly. self scale calcs the covarianceMatrix repeatedly which can be a lot of work, perhaps a local var would speed that up. calculating the whole covariance matrix just for some stdevs, how much programming does that save <g>, more than a half line of code? the covariance thing calcs, i think, variance by dividing by size not (size - 1), is that your intention? i like the architecture, it is simple and obvious to use. i can use it, thanks! werner On Tuesday, July 31, 2018 at 10:20:40 AM UTC+2, Serge Stinckwich wrote:
On Tue, Jul 31, 2018 at 2:43 PM werner kassens <[hidden email]> wrote:
ok done in last version.
I don't understand what you want to say here.
This is the same pattern used in scikit learn in StandardScaler: Would be nice to add more transformers like http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing -- Serge Stinckwich UMI UMMISCO 209 (SU/IRD/UY1)
On Tue, Jul 31, 2018 at 4:03 PM, Serge Stinckwich <[hidden email]> wrote:
On Tue, Jul 31, 2018 at 4:56 PM werner kassens <[hidden email]> wrote:
Sorry I still don't understand. Where do you see a problem ? PMStandardizationScaler use an iterative way to compute mean and variance with PMCovarianceAccumulator. So we need to compute covariance matrix in order to compute variance. A+ Serge Stinckwich UMI UMMISCO 209 (SU/IRD/UY1)
i guess its a misunderstanding, i dont see no problem, sorry for the irritation, Serge. werner On Tue, Jul 31, 2018 at 6:23 PM, Serge Stinckwich <[hidden email]> wrote:
Hi Serge, sum( (x_i - average)^2 for i=1:n) / n is a biased estimator of variance. One must divide by n-1 to obtain an unbiased estimator. That's probably what Werner means. Apart the bias, I have verified the formula, and it sounds correct. But it would deserve an explanation (or a reference to an explanation, in Didier Besset Book?), because it is non trivial. delta_{n+1} is the difference of (average_estimate_{n} - vector_{n+1}). We want to use (average_estimate_{n+1} - vector_{n+1}) in the covariance estimator. But then we should also compensate the evolution of average estimate in previous accumulation... Let's do it in scalar first: sum( (x_i - average_estimate_{n+1})^2/n ) = ( sum( x_i^2)/n - 2*sum(x_i)/n*average_estimate_{n+1}+sum(average_estimate_{n+1}^2)/n ) ... = sum( x_i^2)/n + average_estimate_{n+1}^2 - 2*average_estimate_{n+1}*average_estimate_n Since we have computed: variance_estimate_n = sum( (x_i - average_estimate_n)^2/n ) = sum( x_i^2)/n - average_estimate_n^2 Then compensating the error requires taking: variance_estimate_n_corrected = variance_estimate_n + (average_estimate_{n+1}-average_estimate_n)^2 ... = variance_estimate_n + delta_{n+1}^2 Then, updating the variance with new accumulated value, with biased estimator: variance_estimate_{n+1} = (variance_estimate_n_corrected * n + (average_estimate_{n+1} - x_{n+1})^2) / (n+1) average_estimate_{n+1} = average_estimate_n - delta_{n+1} average_estimate_{n+1} - x_{n+1} = average_estimate_n - x_{n+1} - delta_{n+1} ... = {n+1)*delta_{n+1} - delta_{n+1} ... = n * delta_{n+1} So, if I did not messed up so far: variance_estimate_{n+1} = (n * variance_estimate_n + n * delta_{n+1}^2 + n^2*delta_{n+1}) / (n+1) IOW: variance_estimate_{n+1} = variance_estimate_n * n/(n+1) + n*delta_{n+1}^2 This can be extended to covariance, and we indeed find the iterative formula which is programmed. 2018-07-31 18:33 GMT+02:00 werner kassens <[hidden email]>:
On Tue, Jul 31, 2018 at 10:28 PM Nicolas Cellier <[hidden email]> wrote:
Thank you for the math, Nicolas ! Actually, Collection>>stdev is not part of PolyMath but Pharo. Collection>>stdev | avg sample sum | "In statistics, the standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values. For details about implementation see comment in self sum." avg := self average. sample := self anyOne. sum := self inject: sample into: [ :accum :each | accum + (each - avg) squared ]. sum := sum - sample. ^ (sum / (self size - 1)) sqrt Serge Stinckwich UMI UMMISCO 209 (SU/IRD/UY1)
