Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

philippeback
---------- Message transféré ----------
De : "[hidden email]" <[hidden email]>
Date : 11 mai 2017 10:54
Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev
À : "Nick Papoylias" <[hidden email]>
Cc :



On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <[hidden email]> wrote:


On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <[hidden email]> wrote:
A. Work done
  • Downloaded the threaded VM as suggested by Esteban Lorenzano to make Iceberg work. And it does! I have successfully pushed my NeuralNetwork code to GitHub: https://github.com/olekscode/MLNeuralNetwork
  • Joined the PolyMath organization on GitHub
  • Created a repository for the TabularDataset project https://github.com/PolyMathOrg/TabularDataset as a part of PolyMath organization on GitHub
  • Fixed a PolyMath issue #25 and made a PR
  • Read an article from Wolfram Mathematica documentation regarding Dataset. It was one of the reading suggestions sent to me by Nick Papoylias
B. Next steps
  • Fix more issues of PolyMath, using Iceberg. I have to get used to it by the time the coding phase starts
  • Read the rest of Nick Papoylias's suggestions
C. Help needed
  • The Dataset in Wolfram, as well as Pandas in Python, has a very advanced indexing system. Smalltalk has its own special conventions for indexing, so I think that it would be great if I got familiar with them. Could you suggest me some reading on this topic (what are the indexing conventions in Smalltalk?).
    For example, in Wolfram, I can write dataset[[-1]] to extract the last row. But in Pharo indexes can not be negative. In Pharo I would say dataset last. But how about dataset[[-5]]?
This would be a good exercise for you ;) In Pharo you can easily add negative indexing yourself. 

Hint: You know the index of the last element, since this is the size of the collection, so... ;)

No need for changes, this exists already.

Use atWrap: index put: value and atWrap: with negative indexes.
'hello' atWrap: -2

There is a specific version for Array using a primitive.
#[ 10 20 30 40 ] atWrap: -1

atWrap:0 gives you the last item.
atWrap: -1 gives 30

This is different from 0 based index languages.

The interesing thing about atWrap: is that it uses modulo interally so you do not need to care about that.

($/ split: 'abc/def/ghi/jkl') atWrap: -1 
--> 'ghi'

The Matrix class has a bunch of things API wise but the class is highly inefficient, doing copies all the time etc. It would be nice to have some kind of futures/copy on write style things in there.

I miss cbind and rbind. These are useful. I have some half baked super inefficient implementations of these things for Matrix.


The ability to name columns is also nice to have.

In R one does: 

df <- dataframe()
cbind(df, c(1,2,3))
cbind(df, c(4,5,6))
names(df)<-("C1", "C2", "C3")
names can be found back with:

names(df)

A Smalltalkish style would be welcome.

Maybe looking at the Voyage queries can be helpful. 

Phil
 
 
Try adding an extention method to Ordrered or SequenceableCollection.

If the Pharo by example chapter is not enough or the MOOC, read the source
itself in the core, to see how basic methods are implemented (it is less scary,
than it sounds).

You can also try Chapters 9, 10, 11 of the blue book (some API changes may apply):


  • Or what is the best way of implementing this index: dataset[["name"]] (extracts a named row), dataset[[1]] (extracts the first row)? Should I create two separate messages: dataset rowNamed: 'name' and dataset rowAt: 1?
rowNamed:
rowAt: 

yes, look like it.

But if we want to model things like R dataframes for example, this has to be seen as a vectorized operation, so you can to use row slices, column slices, and logical indexes.

Check this out: 


 
The internal representation of your data-structure can be anything at the moment, as long as you encapsulate it.

(ie it can be nested OrderedCollections with meta-data for column-names to indexes, or dictionary of collections etc). 

If you don't expose it to the user (ie return it from the public api, or expect knowledge of it in argument passing), 
we can easily change it later. So first make it work, and we optimize later ;)

For your case it will be a little bit trickier because you also have the notions of a) rows and b) columns, which
are exposed to the user. So you would need to create abstractions for these too.

Cheers,

Nick


If someone else is having problems with Iceberg on Linux, try downloading the threaded VM:
wget -O- get.pharo.org/vmT60 | bash
And use SSH (not HTTPS) remote URL.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

SergeStinckwich


Envoyé de mon iPhone

Le 11 mai 2017 à 11:43, "[hidden email]" <[hidden email]> a écrit :

---------- Message transféré ----------
De : "[hidden email]" <[hidden email]>
Date : 11 mai 2017 10:54
Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev
À : "Nick Papoylias" <[hidden email]>
Cc :



On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <[hidden email]> wrote:


On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <[hidden email]> wrote:
A. Work done
  • Downloaded the threaded VM as suggested by Esteban Lorenzano to make Iceberg work. And it does! I have successfully pushed my NeuralNetwork code to GitHub: https://github.com/olekscode/MLNeuralNetwork
  • Joined the PolyMath organization on GitHub
  • Created a repository for the TabularDataset project https://github.com/PolyMathOrg/TabularDataset as a part of PolyMath organization on GitHub
  • Fixed a PolyMath issue #25 and made a PR
  • Read an article from Wolfram Mathematica documentation regarding Dataset. It was one of the reading suggestions sent to me by Nick Papoylias
B. Next steps
  • Fix more issues of PolyMath, using Iceberg. I have to get used to it by the time the coding phase starts
  • Read the rest of Nick Papoylias's suggestions
C. Help needed
  • The Dataset in Wolfram, as well as Pandas in Python, has a very advanced indexing system. Smalltalk has its own special conventions for indexing, so I think that it would be great if I got familiar with them. Could you suggest me some reading on this topic (what are the indexing conventions in Smalltalk?).
    For example, in Wolfram, I can write dataset[[-1]] to extract the last row. But in Pharo indexes can not be negative. In Pharo I would say dataset last. But how about dataset[[-5]]?
This would be a good exercise for you ;) In Pharo you can easily add negative indexing yourself. 

Hint: You know the index of the last element, since this is the size of the collection, so... ;)

No need for changes, this exists already.

Use atWrap: index put: value and atWrap: with negative indexes.
'hello' atWrap: -2

There is a specific version for Array using a primitive.
#[ 10 20 30 40 ] atWrap: -1

atWrap:0 gives you the last item.
atWrap: -1 gives 30

This is different from 0 based index languages.

The interesing thing about atWrap: is that it uses modulo interally so you do not need to care about that.

($/ split: 'abc/def/ghi/jkl') atWrap: -1 
--> 'ghi'

The Matrix class has a bunch of things API wise but the class is highly inefficient, doing copies all the time etc. It would be nice to have some kind of futures/copy on write style things in there.

I miss cbind and rbind. These are useful. I have some half baked super inefficient implementations of these things for Matrix.


The ability to name columns is also nice to have.

In R one does: 

df <- dataframe()
cbind(df, c(1,2,3))
cbind(df, c(4,5,6))
names(df)<-("C1", "C2", "C3")
names can be found back with:

names(df)

A Smalltalkish style would be welcome.




Interesting ! Are you coming to PharoDays ? We can talk about that if we found time.

Maybe looking at the Voyage queries can be helpful. 

Phil
 
 
Try adding an extention method to Ordrered or SequenceableCollection.

If the Pharo by example chapter is not enough or the MOOC, read the source
itself in the core, to see how basic methods are implemented (it is less scary,
than it sounds).

You can also try Chapters 9, 10, 11 of the blue book (some API changes may apply):


  • Or what is the best way of implementing this index: dataset[["name"]] (extracts a named row), dataset[[1]] (extracts the first row)? Should I create two separate messages: dataset rowNamed: 'name' and dataset rowAt: 1?
rowNamed:
rowAt: 

yes, look like it.

But if we want to model things like R dataframes for example, this has to be seen as a vectorized operation, so you can to use row slices, column slices, and logical indexes.

Check this out: 


 
The internal representation of your data-structure can be anything at the moment, as long as you encapsulate it.

(ie it can be nested OrderedCollections with meta-data for column-names to indexes, or dictionary of collections etc). 

If you don't expose it to the user (ie return it from the public api, or expect knowledge of it in argument passing), 
we can easily change it later. So first make it work, and we optimize later ;)

For your case it will be a little bit trickier because you also have the notions of a) rows and b) columns, which
are exposed to the user. So you would need to create abstractions for these too.

Cheers,

Nick


If someone else is having problems with Iceberg on Linux, try downloading the threaded VM:
wget -O- get.pharo.org/vmT60 | bash
And use SSH (not HTTPS) remote URL.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Oleksandr Zaitsev
I would love to, but to go to Lille from my country I would need a visa. Which is not that easy to acquire.
So maybe I will come to PharoDays 2018.
And I will definitely try to come to ESUG Conference in September.

Oleks

On Tue, May 16, 2017 at 7:26 PM, <[hidden email]> wrote:


Envoyé de mon iPhone

Le 11 mai 2017 à 11:43, "[hidden email]" <[hidden email]> a écrit :

---------- Message transféré ----------
De : "[hidden email]" <[hidden email]>
Date : 11 mai 2017 10:54
Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev
À : "Nick Papoylias" <[hidden email]>
Cc :



On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <[hidden email]> wrote:


On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <[hidden email]> wrote:
A. Work done
  • Downloaded the threaded VM as suggested by Esteban Lorenzano to make Iceberg work. And it does! I have successfully pushed my NeuralNetwork code to GitHub: https://github.com/olekscode/MLNeuralNetwork
  • Joined the PolyMath organization on GitHub
  • Created a repository for the TabularDataset project https://github.com/PolyMathOrg/TabularDataset as a part of PolyMath organization on GitHub
  • Fixed a PolyMath issue #25 and made a PR
  • Read an article from Wolfram Mathematica documentation regarding Dataset. It was one of the reading suggestions sent to me by Nick Papoylias
B. Next steps
  • Fix more issues of PolyMath, using Iceberg. I have to get used to it by the time the coding phase starts
  • Read the rest of Nick Papoylias's suggestions
C. Help needed
  • The Dataset in Wolfram, as well as Pandas in Python, has a very advanced indexing system. Smalltalk has its own special conventions for indexing, so I think that it would be great if I got familiar with them. Could you suggest me some reading on this topic (what are the indexing conventions in Smalltalk?).
    For example, in Wolfram, I can write dataset[[-1]] to extract the last row. But in Pharo indexes can not be negative. In Pharo I would say dataset last. But how about dataset[[-5]]?
This would be a good exercise for you ;) In Pharo you can easily add negative indexing yourself. 

Hint: You know the index of the last element, since this is the size of the collection, so... ;)

No need for changes, this exists already.

Use atWrap: index put: value and atWrap: with negative indexes.
'hello' atWrap: -2

There is a specific version for Array using a primitive.
#[ 10 20 30 40 ] atWrap: -1

atWrap:0 gives you the last item.
atWrap: -1 gives 30

This is different from 0 based index languages.

The interesing thing about atWrap: is that it uses modulo interally so you do not need to care about that.

($/ split: 'abc/def/ghi/jkl') atWrap: -1 
--> 'ghi'

The Matrix class has a bunch of things API wise but the class is highly inefficient, doing copies all the time etc. It would be nice to have some kind of futures/copy on write style things in there.

I miss cbind and rbind. These are useful. I have some half baked super inefficient implementations of these things for Matrix.


The ability to name columns is also nice to have.

In R one does: 

df <- dataframe()
cbind(df, c(1,2,3))
cbind(df, c(4,5,6))
names(df)<-("C1", "C2", "C3")
names can be found back with:

names(df)

A Smalltalkish style would be welcome.




Interesting ! Are you coming to PharoDays ? We can talk about that if we found time.

Maybe looking at the Voyage queries can be helpful. 

Phil
 
 
Try adding an extention method to Ordrered or SequenceableCollection.

If the Pharo by example chapter is not enough or the MOOC, read the source
itself in the core, to see how basic methods are implemented (it is less scary,
than it sounds).

You can also try Chapters 9, 10, 11 of the blue book (some API changes may apply):


  • Or what is the best way of implementing this index: dataset[["name"]] (extracts a named row), dataset[[1]] (extracts the first row)? Should I create two separate messages: dataset rowNamed: 'name' and dataset rowAt: 1?
rowNamed:
rowAt: 

yes, look like it.

But if we want to model things like R dataframes for example, this has to be seen as a vectorized operation, so you can to use row slices, column slices, and logical indexes.

Check this out: 


 
The internal representation of your data-structure can be anything at the moment, as long as you encapsulate it.

(ie it can be nested OrderedCollections with meta-data for column-names to indexes, or dictionary of collections etc). 

If you don't expose it to the user (ie return it from the public api, or expect knowledge of it in argument passing), 
we can easily change it later. So first make it work, and we optimize later ;)

For your case it will be a little bit trickier because you also have the notions of a) rows and b) columns, which
are exposed to the user. So you would need to create abstractions for these too.

Cheers,

Nick


If someone else is having problems with Iceberg on Linux, try downloading the threaded VM:
wget -O- get.pharo.org/vmT60 | bash
And use SSH (not HTTPS) remote URL.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

SergeStinckwich
I was asking Philippe but hope to see you also at ESUG !

Envoyé de mon iPhone

Le 16 mai 2017 à 19:02, Oleksandr Zaytsev <[hidden email]> a écrit :

I would love to, but to go to Lille from my country I would need a visa. Which is not that easy to acquire.
So maybe I will come to PharoDays 2018.
And I will definitely try to come to ESUG Conference in September.

Oleks

On Tue, May 16, 2017 at 7:26 PM, <[hidden email]> wrote:


Envoyé de mon iPhone

Le 11 mai 2017 à 11:43, "[hidden email]" <[hidden email]> a écrit :

---------- Message transféré ----------
De : "[hidden email]" <[hidden email]>
Date : 11 mai 2017 10:54
Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev
À : "Nick Papoylias" <[hidden email]>
Cc :



On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <[hidden email]> wrote:


On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <[hidden email]> wrote:
A. Work done
  • Downloaded the threaded VM as suggested by Esteban Lorenzano to make Iceberg work. And it does! I have successfully pushed my NeuralNetwork code to GitHub: https://github.com/olekscode/MLNeuralNetwork
  • Joined the PolyMath organization on GitHub
  • Created a repository for the TabularDataset project https://github.com/PolyMathOrg/TabularDataset as a part of PolyMath organization on GitHub
  • Fixed a PolyMath issue #25 and made a PR
  • Read an article from Wolfram Mathematica documentation regarding Dataset. It was one of the reading suggestions sent to me by Nick Papoylias
B. Next steps
  • Fix more issues of PolyMath, using Iceberg. I have to get used to it by the time the coding phase starts
  • Read the rest of Nick Papoylias's suggestions
C. Help needed
  • The Dataset in Wolfram, as well as Pandas in Python, has a very advanced indexing system. Smalltalk has its own special conventions for indexing, so I think that it would be great if I got familiar with them. Could you suggest me some reading on this topic (what are the indexing conventions in Smalltalk?).
    For example, in Wolfram, I can write dataset[[-1]] to extract the last row. But in Pharo indexes can not be negative. In Pharo I would say dataset last. But how about dataset[[-5]]?
This would be a good exercise for you ;) In Pharo you can easily add negative indexing yourself. 

Hint: You know the index of the last element, since this is the size of the collection, so... ;)

No need for changes, this exists already.

Use atWrap: index put: value and atWrap: with negative indexes.
'hello' atWrap: -2

There is a specific version for Array using a primitive.
#[ 10 20 30 40 ] atWrap: -1

atWrap:0 gives you the last item.
atWrap: -1 gives 30

This is different from 0 based index languages.

The interesing thing about atWrap: is that it uses modulo interally so you do not need to care about that.

($/ split: 'abc/def/ghi/jkl') atWrap: -1 
--> 'ghi'

The Matrix class has a bunch of things API wise but the class is highly inefficient, doing copies all the time etc. It would be nice to have some kind of futures/copy on write style things in there.

I miss cbind and rbind. These are useful. I have some half baked super inefficient implementations of these things for Matrix.


The ability to name columns is also nice to have.

In R one does: 

df <- dataframe()
cbind(df, c(1,2,3))
cbind(df, c(4,5,6))
names(df)<-("C1", "C2", "C3")
names can be found back with:

names(df)

A Smalltalkish style would be welcome.




Interesting ! Are you coming to PharoDays ? We can talk about that if we found time.

Maybe looking at the Voyage queries can be helpful. 

Phil
 
 
Try adding an extention method to Ordrered or SequenceableCollection.

If the Pharo by example chapter is not enough or the MOOC, read the source
itself in the core, to see how basic methods are implemented (it is less scary,
than it sounds).

You can also try Chapters 9, 10, 11 of the blue book (some API changes may apply):


  • Or what is the best way of implementing this index: dataset[["name"]] (extracts a named row), dataset[[1]] (extracts the first row)? Should I create two separate messages: dataset rowNamed: 'name' and dataset rowAt: 1?
rowNamed:
rowAt: 

yes, look like it.

But if we want to model things like R dataframes for example, this has to be seen as a vectorized operation, so you can to use row slices, column slices, and logical indexes.

Check this out: 


 
The internal representation of your data-structure can be anything at the moment, as long as you encapsulate it.

(ie it can be nested OrderedCollections with meta-data for column-names to indexes, or dictionary of collections etc). 

If you don't expose it to the user (ie return it from the public api, or expect knowledge of it in argument passing), 
we can easily change it later. So first make it work, and we optimize later ;)

For your case it will be a little bit trickier because you also have the notions of a) rows and b) columns, which
are exposed to the user. So you would need to create abstractions for these too.

Cheers,

Nick


If someone else is having problems with Iceberg on Linux, try downloading the threaded VM:
wget -O- get.pharo.org/vmT60 | bash
And use SSH (not HTTPS) remote URL.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

philippeback
In reply to this post by SergeStinckwich

On Tue, May 16, 2017 at 6:26 PM, <[hidden email]> wrote:


Envoyé de mon iPhone

Le 11 mai 2017 à 11:43, "[hidden email]" <[hidden email]> a écrit :

---------- Message transféré ----------
De : "[hidden email]" <[hidden email]>
Date : 11 mai 2017 10:54
Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev
À : "Nick Papoylias" <[hidden email]>
Cc :



On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <[hidden email]> wrote:


On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <[hidden email]> wrote:
A. Work done
  • Downloaded the threaded VM as suggested by Esteban Lorenzano to make Iceberg work. And it does! I have successfully pushed my NeuralNetwork code to GitHub: https://github.com/olekscode/MLNeuralNetwork
  • Joined the PolyMath organization on GitHub
  • Created a repository for the TabularDataset project https://github.com/PolyMathOrg/TabularDataset as a part of PolyMath organization on GitHub
  • Fixed a PolyMath issue #25 and made a PR
  • Read an article from Wolfram Mathematica documentation regarding Dataset. It was one of the reading suggestions sent to me by Nick Papoylias
B. Next steps
  • Fix more issues of PolyMath, using Iceberg. I have to get used to it by the time the coding phase starts
  • Read the rest of Nick Papoylias's suggestions
C. Help needed
  • The Dataset in Wolfram, as well as Pandas in Python, has a very advanced indexing system. Smalltalk has its own special conventions for indexing, so I think that it would be great if I got familiar with them. Could you suggest me some reading on this topic (what are the indexing conventions in Smalltalk?).
    For example, in Wolfram, I can write dataset[[-1]] to extract the last row. But in Pharo indexes can not be negative. In Pharo I would say dataset last. But how about dataset[[-5]]?
This would be a good exercise for you ;) In Pharo you can easily add negative indexing yourself. 

Hint: You know the index of the last element, since this is the size of the collection, so... ;)

No need for changes, this exists already.

Use atWrap: index put: value and atWrap: with negative indexes.
'hello' atWrap: -2

There is a specific version for Array using a primitive.
#[ 10 20 30 40 ] atWrap: -1

atWrap:0 gives you the last item.
atWrap: -1 gives 30

This is different from 0 based index languages.

The interesing thing about atWrap: is that it uses modulo interally so you do not need to care about that.

($/ split: 'abc/def/ghi/jkl') atWrap: -1 
--> 'ghi'

The Matrix class has a bunch of things API wise but the class is highly inefficient, doing copies all the time etc. It would be nice to have some kind of futures/copy on write style things in there.

I miss cbind and rbind. These are useful. I have some half baked super inefficient implementations of these things for Matrix.


The ability to name columns is also nice to have.

In R one does: 

df <- dataframe()
cbind(df, c(1,2,3))
cbind(df, c(4,5,6))
names(df)<-("C1", "C2", "C3")
names can be found back with:

names(df)

A Smalltalkish style would be welcome.




Interesting ! Are you coming to PharoDays ? We can talk about that if we found time.

I am. There is a 'coding time' slot :-)

Phil 

Maybe looking at the Voyage queries can be helpful. 

Phil
 
 
Try adding an extention method to Ordrered or SequenceableCollection.

If the Pharo by example chapter is not enough or the MOOC, read the source
itself in the core, to see how basic methods are implemented (it is less scary,
than it sounds).

You can also try Chapters 9, 10, 11 of the blue book (some API changes may apply):


  • Or what is the best way of implementing this index: dataset[["name"]] (extracts a named row), dataset[[1]] (extracts the first row)? Should I create two separate messages: dataset rowNamed: 'name' and dataset rowAt: 1?
rowNamed:
rowAt: 

yes, look like it.

But if we want to model things like R dataframes for example, this has to be seen as a vectorized operation, so you can to use row slices, column slices, and logical indexes.

Check this out: 


 
The internal representation of your data-structure can be anything at the moment, as long as you encapsulate it.

(ie it can be nested OrderedCollections with meta-data for column-names to indexes, or dictionary of collections etc). 

If you don't expose it to the user (ie return it from the public api, or expect knowledge of it in argument passing), 
we can easily change it later. So first make it work, and we optimize later ;)

For your case it will be a little bit trickier because you also have the notions of a) rows and b) columns, which
are exposed to the user. So you would need to create abstractions for these too.

Cheers,

Nick


If someone else is having problems with Iceberg on Linux, try downloading the threaded VM:
wget -O- get.pharo.org/vmT60 | bash
And use SSH (not HTTPS) remote URL.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

philippeback
In reply to this post by SergeStinckwich
We may also use Discord and do something "somewhat live"

Phil

On Tue, May 16, 2017 at 7:23 PM, <[hidden email]> wrote:
I was asking Philippe but hope to see you also at ESUG !

Envoyé de mon iPhone

Le 16 mai 2017 à 19:02, Oleksandr Zaytsev <[hidden email]> a écrit :

I would love to, but to go to Lille from my country I would need a visa. Which is not that easy to acquire.
So maybe I will come to PharoDays 2018.
And I will definitely try to come to ESUG Conference in September.

Oleks

On Tue, May 16, 2017 at 7:26 PM, <[hidden email]> wrote:


Envoyé de mon iPhone

Le 11 mai 2017 à 11:43, "[hidden email]" <[hidden email]> a écrit :

---------- Message transféré ----------
De : "[hidden email]" <[hidden email]>
Date : 11 mai 2017 10:54
Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev
À : "Nick Papoylias" <[hidden email]>
Cc :



On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <[hidden email]> wrote:


On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <[hidden email]> wrote:
A. Work done
  • Downloaded the threaded VM as suggested by Esteban Lorenzano to make Iceberg work. And it does! I have successfully pushed my NeuralNetwork code to GitHub: https://github.com/olekscode/MLNeuralNetwork
  • Joined the PolyMath organization on GitHub
  • Created a repository for the TabularDataset project https://github.com/PolyMathOrg/TabularDataset as a part of PolyMath organization on GitHub
  • Fixed a PolyMath issue #25 and made a PR
  • Read an article from Wolfram Mathematica documentation regarding Dataset. It was one of the reading suggestions sent to me by Nick Papoylias
B. Next steps
  • Fix more issues of PolyMath, using Iceberg. I have to get used to it by the time the coding phase starts
  • Read the rest of Nick Papoylias's suggestions
C. Help needed
  • The Dataset in Wolfram, as well as Pandas in Python, has a very advanced indexing system. Smalltalk has its own special conventions for indexing, so I think that it would be great if I got familiar with them. Could you suggest me some reading on this topic (what are the indexing conventions in Smalltalk?).
    For example, in Wolfram, I can write dataset[[-1]] to extract the last row. But in Pharo indexes can not be negative. In Pharo I would say dataset last. But how about dataset[[-5]]?
This would be a good exercise for you ;) In Pharo you can easily add negative indexing yourself. 

Hint: You know the index of the last element, since this is the size of the collection, so... ;)

No need for changes, this exists already.

Use atWrap: index put: value and atWrap: with negative indexes.
'hello' atWrap: -2

There is a specific version for Array using a primitive.
#[ 10 20 30 40 ] atWrap: -1

atWrap:0 gives you the last item.
atWrap: -1 gives 30

This is different from 0 based index languages.

The interesing thing about atWrap: is that it uses modulo interally so you do not need to care about that.

($/ split: 'abc/def/ghi/jkl') atWrap: -1 
--> 'ghi'

The Matrix class has a bunch of things API wise but the class is highly inefficient, doing copies all the time etc. It would be nice to have some kind of futures/copy on write style things in there.

I miss cbind and rbind. These are useful. I have some half baked super inefficient implementations of these things for Matrix.


The ability to name columns is also nice to have.

In R one does: 

df <- dataframe()
cbind(df, c(1,2,3))
cbind(df, c(4,5,6))
names(df)<-("C1", "C2", "C3")
names can be found back with:

names(df)

A Smalltalkish style would be welcome.




Interesting ! Are you coming to PharoDays ? We can talk about that if we found time.

Maybe looking at the Voyage queries can be helpful. 

Phil
 
 
Try adding an extention method to Ordrered or SequenceableCollection.

If the Pharo by example chapter is not enough or the MOOC, read the source
itself in the core, to see how basic methods are implemented (it is less scary,
than it sounds).

You can also try Chapters 9, 10, 11 of the blue book (some API changes may apply):


  • Or what is the best way of implementing this index: dataset[["name"]] (extracts a named row), dataset[[1]] (extracts the first row)? Should I create two separate messages: dataset rowNamed: 'name' and dataset rowAt: 1?
rowNamed:
rowAt: 

yes, look like it.

But if we want to model things like R dataframes for example, this has to be seen as a vectorized operation, so you can to use row slices, column slices, and logical indexes.

Check this out: 


 
The internal representation of your data-structure can be anything at the moment, as long as you encapsulate it.

(ie it can be nested OrderedCollections with meta-data for column-names to indexes, or dictionary of collections etc). 

If you don't expose it to the user (ie return it from the public api, or expect knowledge of it in argument passing), 
we can easily change it later. So first make it work, and we optimize later ;)

For your case it will be a little bit trickier because you also have the notions of a) rows and b) columns, which
are exposed to the user. So you would need to create abstractions for these too.

Cheers,

Nick


If someone else is having problems with Iceberg on Linux, try downloading the threaded VM:
wget -O- get.pharo.org/vmT60 | bash
And use SSH (not HTTPS) remote URL.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.



Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Stephane Ducasse-3
I'm interested to help for such new "containers".
May be we should proceed that way:


On Tue, May 16, 2017 at 7:44 PM, [hidden email] <[hidden email]> wrote:
We may also use Discord and do something "somewhat live"

Phil

On Tue, May 16, 2017 at 7:23 PM, <[hidden email]> wrote:
I was asking Philippe but hope to see you also at ESUG !

Envoyé de mon iPhone

Le 16 mai 2017 à 19:02, Oleksandr Zaytsev <[hidden email]> a écrit :

I would love to, but to go to Lille from my country I would need a visa. Which is not that easy to acquire.
So maybe I will come to PharoDays 2018.
And I will definitely try to come to ESUG Conference in September.

Oleks

On Tue, May 16, 2017 at 7:26 PM, <[hidden email]> wrote:


Envoyé de mon iPhone

Le 11 mai 2017 à 11:43, "[hidden email]" <[hidden email]> a écrit :

---------- Message transféré ----------
De : "[hidden email]" <[hidden email]>
Date : 11 mai 2017 10:54
Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev
À : "Nick Papoylias" <[hidden email]>
Cc :



On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <[hidden email]> wrote:


On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <[hidden email]> wrote:
A. Work done
  • Downloaded the threaded VM as suggested by Esteban Lorenzano to make Iceberg work. And it does! I have successfully pushed my NeuralNetwork code to GitHub: https://github.com/olekscode/MLNeuralNetwork
  • Joined the PolyMath organization on GitHub
  • Created a repository for the TabularDataset project https://github.com/PolyMathOrg/TabularDataset as a part of PolyMath organization on GitHub
  • Fixed a PolyMath issue #25 and made a PR
  • Read an article from Wolfram Mathematica documentation regarding Dataset. It was one of the reading suggestions sent to me by Nick Papoylias
B. Next steps
  • Fix more issues of PolyMath, using Iceberg. I have to get used to it by the time the coding phase starts
  • Read the rest of Nick Papoylias's suggestions
C. Help needed
  • The Dataset in Wolfram, as well as Pandas in Python, has a very advanced indexing system. Smalltalk has its own special conventions for indexing, so I think that it would be great if I got familiar with them. Could you suggest me some reading on this topic (what are the indexing conventions in Smalltalk?).
    For example, in Wolfram, I can write dataset[[-1]] to extract the last row. But in Pharo indexes can not be negative. In Pharo I would say dataset last. But how about dataset[[-5]]?
This would be a good exercise for you ;) In Pharo you can easily add negative indexing yourself. 

Hint: You know the index of the last element, since this is the size of the collection, so... ;)

No need for changes, this exists already.

Use atWrap: index put: value and atWrap: with negative indexes.
'hello' atWrap: -2

There is a specific version for Array using a primitive.
#[ 10 20 30 40 ] atWrap: -1

atWrap:0 gives you the last item.
atWrap: -1 gives 30

This is different from 0 based index languages.

The interesing thing about atWrap: is that it uses modulo interally so you do not need to care about that.

($/ split: 'abc/def/ghi/jkl') atWrap: -1 
--> 'ghi'

The Matrix class has a bunch of things API wise but the class is highly inefficient, doing copies all the time etc. It would be nice to have some kind of futures/copy on write style things in there.

I miss cbind and rbind. These are useful. I have some half baked super inefficient implementations of these things for Matrix.


The ability to name columns is also nice to have.

In R one does: 

df <- dataframe()
cbind(df, c(1,2,3))
cbind(df, c(4,5,6))
names(df)<-("C1", "C2", "C3")
names can be found back with:

names(df)

A Smalltalkish style would be welcome.




Interesting ! Are you coming to PharoDays ? We can talk about that if we found time.

Maybe looking at the Voyage queries can be helpful. 

Phil
 
 
Try adding an extention method to Ordrered or SequenceableCollection.

If the Pharo by example chapter is not enough or the MOOC, read the source
itself in the core, to see how basic methods are implemented (it is less scary,
than it sounds).

You can also try Chapters 9, 10, 11 of the blue book (some API changes may apply):


  • Or what is the best way of implementing this index: dataset[["name"]] (extracts a named row), dataset[[1]] (extracts the first row)? Should I create two separate messages: dataset rowNamed: 'name' and dataset rowAt: 1?
rowNamed:
rowAt: 

yes, look like it.

But if we want to model things like R dataframes for example, this has to be seen as a vectorized operation, so you can to use row slices, column slices, and logical indexes.

Check this out: 


 
The internal representation of your data-structure can be anything at the moment, as long as you encapsulate it.

(ie it can be nested OrderedCollections with meta-data for column-names to indexes, or dictionary of collections etc). 

If you don't expose it to the user (ie return it from the public api, or expect knowledge of it in argument passing), 
we can easily change it later. So first make it work, and we optimize later ;)

For your case it will be a little bit trickier because you also have the notions of a) rows and b) columns, which
are exposed to the user. So you would need to create abstractions for these too.

Cheers,

Nick


If someone else is having problems with Iceberg on Linux, try downloading the threaded VM:
wget -O- get.pharo.org/vmT60 | bash
And use SSH (not HTTPS) remote URL.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.




Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Stephane Ducasse-3
write some tests and ask for a good implementations. 

Crazy implementors like henrik can probably beat us all :)


On Wed, May 17, 2017 at 7:55 PM, Stephane Ducasse <[hidden email]> wrote:
I'm interested to help for such new "containers".
May be we should proceed that way:


On Tue, May 16, 2017 at 7:44 PM, [hidden email] <[hidden email]> wrote:
We may also use Discord and do something "somewhat live"

Phil

On Tue, May 16, 2017 at 7:23 PM, <[hidden email]> wrote:
I was asking Philippe but hope to see you also at ESUG !

Envoyé de mon iPhone

Le 16 mai 2017 à 19:02, Oleksandr Zaytsev <[hidden email]> a écrit :

I would love to, but to go to Lille from my country I would need a visa. Which is not that easy to acquire.
So maybe I will come to PharoDays 2018.
And I will definitely try to come to ESUG Conference in September.

Oleks

On Tue, May 16, 2017 at 7:26 PM, <[hidden email]> wrote:


Envoyé de mon iPhone

Le 11 mai 2017 à 11:43, "[hidden email]" <[hidden email]> a écrit :

---------- Message transféré ----------
De : "[hidden email]" <[hidden email]>
Date : 11 mai 2017 10:54
Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev
À : "Nick Papoylias" <[hidden email]>
Cc :



On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <[hidden email]> wrote:


On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <[hidden email]> wrote:
A. Work done
  • Downloaded the threaded VM as suggested by Esteban Lorenzano to make Iceberg work. And it does! I have successfully pushed my NeuralNetwork code to GitHub: https://github.com/olekscode/MLNeuralNetwork
  • Joined the PolyMath organization on GitHub
  • Created a repository for the TabularDataset project https://github.com/PolyMathOrg/TabularDataset as a part of PolyMath organization on GitHub
  • Fixed a PolyMath issue #25 and made a PR
  • Read an article from Wolfram Mathematica documentation regarding Dataset. It was one of the reading suggestions sent to me by Nick Papoylias
B. Next steps
  • Fix more issues of PolyMath, using Iceberg. I have to get used to it by the time the coding phase starts
  • Read the rest of Nick Papoylias's suggestions
C. Help needed
  • The Dataset in Wolfram, as well as Pandas in Python, has a very advanced indexing system. Smalltalk has its own special conventions for indexing, so I think that it would be great if I got familiar with them. Could you suggest me some reading on this topic (what are the indexing conventions in Smalltalk?).
    For example, in Wolfram, I can write dataset[[-1]] to extract the last row. But in Pharo indexes can not be negative. In Pharo I would say dataset last. But how about dataset[[-5]]?
This would be a good exercise for you ;) In Pharo you can easily add negative indexing yourself. 

Hint: You know the index of the last element, since this is the size of the collection, so... ;)

No need for changes, this exists already.

Use atWrap: index put: value and atWrap: with negative indexes.
'hello' atWrap: -2

There is a specific version for Array using a primitive.
#[ 10 20 30 40 ] atWrap: -1

atWrap:0 gives you the last item.
atWrap: -1 gives 30

This is different from 0 based index languages.

The interesing thing about atWrap: is that it uses modulo interally so you do not need to care about that.

($/ split: 'abc/def/ghi/jkl') atWrap: -1 
--> 'ghi'

The Matrix class has a bunch of things API wise but the class is highly inefficient, doing copies all the time etc. It would be nice to have some kind of futures/copy on write style things in there.

I miss cbind and rbind. These are useful. I have some half baked super inefficient implementations of these things for Matrix.


The ability to name columns is also nice to have.

In R one does: 

df <- dataframe()
cbind(df, c(1,2,3))
cbind(df, c(4,5,6))
names(df)<-("C1", "C2", "C3")
names can be found back with:

names(df)

A Smalltalkish style would be welcome.




Interesting ! Are you coming to PharoDays ? We can talk about that if we found time.

Maybe looking at the Voyage queries can be helpful. 

Phil
 
 
Try adding an extention method to Ordrered or SequenceableCollection.

If the Pharo by example chapter is not enough or the MOOC, read the source
itself in the core, to see how basic methods are implemented (it is less scary,
than it sounds).

You can also try Chapters 9, 10, 11 of the blue book (some API changes may apply):


  • Or what is the best way of implementing this index: dataset[["name"]] (extracts a named row), dataset[[1]] (extracts the first row)? Should I create two separate messages: dataset rowNamed: 'name' and dataset rowAt: 1?
rowNamed:
rowAt: 

yes, look like it.

But if we want to model things like R dataframes for example, this has to be seen as a vectorized operation, so you can to use row slices, column slices, and logical indexes.

Check this out: 


 
The internal representation of your data-structure can be anything at the moment, as long as you encapsulate it.

(ie it can be nested OrderedCollections with meta-data for column-names to indexes, or dictionary of collections etc). 

If you don't expose it to the user (ie return it from the public api, or expect knowledge of it in argument passing), 
we can easily change it later. So first make it work, and we optimize later ;)

For your case it will be a little bit trickier because you also have the notions of a) rows and b) columns, which
are exposed to the user. So you would need to create abstractions for these too.

Cheers,

Nick


If someone else is having problems with Iceberg on Linux, try downloading the threaded VM:
wget -O- get.pharo.org/vmT60 | bash
And use SSH (not HTTPS) remote URL.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Pharo Google Summer of Code" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.