Implementing DataFrame>>dtypes feature in Pharo PolyMath Project

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Implementing DataFrame>>dtypes feature in Pharo PolyMath Project

balaji
Hi

 I am working on implementing the DataFrame>>dtypes feature which checks the datatypes of columns in a DataFrame, as part of my GSOC project. I have tried to explain my theoretical work done so far on this blog post. Please kindly go through it , as I need advice on what could be the optimal way to implement this feature. Any kind of input and discussions is most welcome.


Thank you 

Balaji G
Reply | Threaded
Open this post in threaded view
|

Re: Implementing DataFrame>>dtypes feature in Pharo PolyMath Project

balaji
Thank you sir for the clarification.

On Sat, Jun 19, 2021 at 7:24 PM Konrad Hinsen <[hidden email]> wrote:
Dear Balaji,

>  I am working on implementing the DataFrame>>dtypes feature which
> checks the datatypes of columns in a DataFrame, as part of my GSOC
> project. I have tried to explain my theoretical work done so far on
> this blog post. Please kindly go through it , as I need advice on what
> could be the optimal way to implement this feature. Any kind of input
> and discussions is most welcome.

Your post looks like an overall accurate description of the current
state of everything - with one exception, and that is Pandas. You say
you didn't look at the Pandas code yet, so that's not surprising.

You seem to assume that Pandas stores Python objects as elements of
DataFrames, but that isn't true. Pandas uses NumPy arrays instead. And
NumPy arrays are very different from standard Python objects, because
their internal data layout is by design the same as used in C or
Fortran. For a full description, see

  https://numpy.org/doc/stable/user/basics.rec.html

However, I am not sure you need to understand this in all detail, as I
am pretty sure that you do not want to copy this approach in Pharo.

The one point that does matter for you is where NumPy and Pandas take a
column's dtype from. The answer is that it's defined when a DataFrame is
created, and it cannot be changed afterwards. If a column is "integer",
it will remain "integer" forever. If you try to assign a string to an
element of such a column, you get an error message. When you create a
DataFrame from existing data, e.g. by reading a CSV file, Pandas scans
the data and determines a suitable dtype, much in the same way as V1.0
in Pharo/PolyMath did. But since Pandas doesn't allow any later change,
there is no serious performance issue.

So that's an option you can add to your list: define the dtypes once and
for all when the DataFrame is created. The main drawback is that you
would have to change the API for DataFrame creation to make this work.

Cheers,
  Konrad
--
---------------------------------------------------------------------
Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: konrad DOT hinsen AT cnrs DOT fr
http://dirac.cnrs-orleans.fr/~hinsen/
ORCID: https://orcid.org/0000-0003-0330-9428
Twitter: @khinsen
---------------------------------------------------------------------