importing a lot of instance data from a text file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

importing a lot of instance data from a text file

ccrraaiigg

Hi all--

     Apologies for the very basic question...

     A client has given me the contents of an old database as a large
CSV file (one million records), to import into a Gemstone/S database.
When I do the naive thing of:

-    defining a class in the Gemstone database that describes the
     instances in the CSV file, and

-    writing and running a little CSV importer on the class-side of
     that class,

Gemstone runs out of memory. How should I import this data? Is there
some special way to structure the code that does the import, or should I
just increase the amount of memory available somehow?


     thanks!

-C

--
Craig Latta
www.netjam.org/resume
+31   6 2757 7177
+ 1 415  287 3547 (no SMS)
Reply | Threaded
Open this post in threaded view
|

Re: importing a lot of instance data from a text file

James Foster-8
Hi Craig,

The out-of-memory situation is because the data is being held as "temporary" objects in the gem, not as persistent objects in the repository. If this data came in through the normal operation of your application, it would have involved many transactions, each of which persisted some subset of the total. The import process can be the same. Create a root collection, start a loop to import the data, and commit every thousand records. After a thousand commits you will have all the data loaded.

James

On Sep 29, 2012, at 12:24 PM, Craig Latta wrote:

>
> Hi all--
>
>     Apologies for the very basic question...
>
>     A client has given me the contents of an old database as a large
> CSV file (one million records), to import into a Gemstone/S database.
> When I do the naive thing of:
>
> -    defining a class in the Gemstone database that describes the
>     instances in the CSV file, and
>
> -    writing and running a little CSV importer on the class-side of
>     that class,
>
> Gemstone runs out of memory. How should I import this data? Is there
> some special way to structure the code that does the import, or should I
> just increase the amount of memory available somehow?
>
>
>     thanks!
>
> -C
>
> --
> Craig Latta
> www.netjam.org/resume
> +31   6 2757 7177
> + 1 415  287 3547 (no SMS)

Reply | Threaded
Open this post in threaded view
|

Re: importing a lot of instance data from a text file

ccrraaiigg

     Thanks, James. Hm, now my problem seems to be that the following
results in control never returning:

***

| file |

file := GsFile openRead: 'bigCSVFile'.
file upTo: Character lf.

***

The file exists, I can invoke #upTo: with normal letters, and the file
does contain linefeeds (I can see them if I just grab characters out
with #next:).

     Help?


     thanks again!

-C

--
Craig Latta
www.netjam.org/resume
+31   6 2757 7177
+ 1 415  287 3547 (no SMS)
Reply | Threaded
Open this post in threaded view
|

Re: importing a lot of instance data from a text file

ccrraaiigg
In reply to this post by James Foster-8

     Also, the GemStone inspectors seem a little weird, complaining that
various things are unbound (like temporaries and self!). See attached
debugger picture...


-C

--
Craig Latta
www.netjam.org/resume
+31   6 2757 7177
+ 1 415  287 3547 (no SMS)

Parallels DesktopScreenSnapz002.png (43K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: importing a lot of instance data from a text file

Dale Henrichs
In reply to this post by ccrraaiigg
Craig,

First off, GsFile class>>openRead: reads files from the `client` system.

The `client` system is the machine on which the gci process is running (i.e., if gemtools is running on your mac and the stone is running on a linux vm hosted on your mac, the `client` system is your mac and the `server` system is the linux vm).

It is more efficient (especially for large files) to work with files directly on the server (the efficiency is related to how we have implemented the handling of `client` operations).

So you should move the files to the server and use GsFile class>>openReadOnServer: and try again. Operating on `client` files can be very slow even if the gci is running on the same machine as the stone ...

Secondly, if you take a look at GsFile>>upTo: it is implemented as a loop of #next calls (I'm looking at 3.1.0.1 in case it matters):


        | result |
        result := String new.
        self positionA to: self fileSize - 1 do: [:i |
                | char |
                (char := self next) = anObject ifTrue: [^result].
                result add: char.
        ].
        ^result.

I'd be tempted to have you step into the method and see which of the calls is running slow, but using #openReadOnServer: just might be the ticket:)

Dale
----- Original Message -----
| From: "Craig Latta" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Sunday, September 30, 2012 8:03:17 AM
| Subject: Re: [GS/SS Beta] importing a lot of instance data from a text file
|
|
|      Thanks, James. Hm, now my problem seems to be that the following
| results in control never returning:
|
| ***
|
| | file |
|
| file := GsFile openRead: 'bigCSVFile'.
| file upTo: Character lf.
|
| ***
|
| The file exists, I can invoke #upTo: with normal letters, and the
| file
| does contain linefeeds (I can see them if I just grab characters out
| with #next:).
|
|      Help?
|
|
|      thanks again!
|
| -C
|
| --
| Craig Latta
| www.netjam.org/resume
| +31   6 2757 7177
| + 1 415  287 3547 (no SMS)
|
Reply | Threaded
Open this post in threaded view
|

Re: importing a lot of instance data from a text file

Dale Henrichs
In reply to this post by ccrraaiigg
Craig,

I assume that you're getting the 'undefined symbol' error when you select a line in the debugger and try to printit in the debugger? If that's the case, then you are hitting a bug ...

Dale

----- Original Message -----
| From: "Craig Latta" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Sunday, September 30, 2012 8:37:55 AM
| Subject: Re: [GS/SS Beta] importing a lot of instance data from a text file
|
|
|      Also, the GemStone inspectors seem a little weird, complaining
|      that
| various things are unbound (like temporaries and self!). See attached
| debugger picture...
|
|
| -C
|
| --
| Craig Latta
| www.netjam.org/resume
| +31   6 2757 7177
| + 1 415  287 3547 (no SMS)
|
Reply | Threaded
Open this post in threaded view
|

Re: self unbound in inspector panes

ccrraaiigg

> I assume that you're getting the 'undefined symbol' error when you
> select a line in the debugger and try to printit in the debugger? If
> that's the case, then you are hitting a bug ...

     Okay. It happens with "normal" inspectors too, like the attached
inspector on a String.


-C

--
Craig Latta
www.netjam.org/resume
+31   6 2757 7177
+ 1 415  287 3547 (no SMS)

Parallels DesktopScreenSnapz003.png (42K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: importing a lot of instance data from a text file

ccrraaiigg
In reply to this post by Dale Henrichs

Hi Dale--

> ...you should move the files to the server and use GsFile
> class>>openReadOnServer: and try again. Operating on `client` files
> can be very slow even if the gci is running on the same machine as
> the stone ...

     Aha. Well, I had already moved the file to the server, and I'm
running GemTools on the same machine.

> Secondly, if you take a look at GsFile>>upTo: it is implemented as a
> loop of #next calls (I'm looking at 3.1.0.1 in case it matters)...

     (I'm using 3.1.0.1 too.)

> I'd be tempted to have you step into the method and see which of the
> calls is running slow, but using #openReadOnServer: just might be the
> ticket:)

     Somehow using #openReadOnServer: makes a difference, but I have no
idea how. :)  When I was using #openRead:, it was reading alphabetic
characters from the file with no problem; it was only when it came to
the "upTo: Character lf" that it went off into outer space. Hm.
Definitely some bugs floating around, the inability to evaluate
expressions in the debugger being the most serious.


     Well, thanks again!

-C

--
Craig Latta
www.netjam.org/resume
+31   6 2757 7177
+ 1 415  287 3547 (no SMS)
Reply | Threaded
Open this post in threaded view
|

Re: self unbound in inspector panes

Dale Henrichs
In reply to this post by ccrraaiigg
Hmmm... the debugger might be explainable, but I'm surprised that `self` is unbound in the inspector ... mostly I use the chasing inspector (`explore`), so maybe there's something more going on?

...ah, I think that SmallIntegers, Strings and a few of the other objects that get turned into native objects on the Pharo-side end up with `self` being unbound ...

If you inspect a more complex object like an Association, `self` is bound correctly ...

There wasn't enough stack visible in your debugger for me to tell exactly what might be going on, but `self` is unbound only in "special case methods" ...

Dale
----- Original Message -----
| From: "Craig Latta" <[hidden email]>
| To: "Dale Henrichs" <[hidden email]>
| Cc: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Sunday, September 30, 2012 1:04:41 PM
| Subject: re: self unbound in inspector panes
|
|
| > I assume that you're getting the 'undefined symbol' error when you
| > select a line in the debugger and try to printit in the debugger?
| > If
| > that's the case, then you are hitting a bug ...
|
|      Okay. It happens with "normal" inspectors too, like the attached
| inspector on a String.
|
|
| -C
|
| --
| Craig Latta
| www.netjam.org/resume
| +31   6 2757 7177
| + 1 415  287 3547 (no SMS)
|
Reply | Threaded
Open this post in threaded view
|

Re: importing a lot of instance data from a text file

Dale Henrichs
In reply to this post by ccrraaiigg


----- Original Message -----
| From: "Craig Latta" <[hidden email]>
| To: "Dale Henrichs" <[hidden email]>
| Cc: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Sunday, September 30, 2012 1:45:08 PM
| Subject: Re: [GS/SS Beta] importing a lot of instance data from a text file
|
|
| Hi Dale--
|
| > ...you should move the files to the server and use GsFile
| > class>>openReadOnServer: and try again. Operating on `client` files
| > can be very slow even if the gci is running on the same machine as
| > the stone ...
|
|      Aha. Well, I had already moved the file to the server, and I'm
| running GemTools on the same machine.
|
| > Secondly, if you take a look at GsFile>>upTo: it is implemented as
| > a
| > loop of #next calls (I'm looking at 3.1.0.1 in case it matters)...
|
|      (I'm using 3.1.0.1 too.)
|
| > I'd be tempted to have you step into the method and see which of
| > the
| > calls is running slow, but using #openReadOnServer: just might be
| > the
| > ticket:)
|
|      Somehow using #openReadOnServer: makes a difference, but I have
|      no
| idea how. :)

When you open on the server you cut the client (and the gci round trips) out of the equation ... my experience is that working with large files can be very slow over gci ...  

| When I was using #openRead:, it was reading alphabetic
| characters from the file with no problem; it was only when it came to
| the "upTo: Character lf" that it went off into outer space. Hm.
| Definitely some bugs floating around, the inability to evaluate
| expressions in the debugger being the most serious.

As I mention in my other email ... 'self' is not consistently unbound ... If I could see more of your stack, I could probably figure out the problem ..

|
|
|      Well, thanks again!
|
| -C
|
| --
| Craig Latta
| www.netjam.org/resume
| +31   6 2757 7177
| + 1 415  287 3547 (no SMS)
|
Reply | Threaded
Open this post in threaded view
|

Re: importing a lot of instance data from a text file

ccrraaiigg

> When you open on the server you cut the client (and the gci round
> trips) out of the equation ... my experience is that working with
> large files can be very slow over gci ...

     Oh, I mean, the mystery is that with the client version I could
read alphabetic characters but not linefeeds, and with the server
version I could read everything.

> If I could see more of your stack, I could probably figure out the
> problem...

     Sure, I could make an account for you and you could play around
with it yourself?


     thanks again,

-C

--
Craig Latta
www.netjam.org/resume
+31   6 2757 7177
+ 1 415  287 3547 (no SMS)
Reply | Threaded
Open this post in threaded view
|

Re: importing a lot of instance data from a text file

Dale Henrichs


----- Original Message -----
| From: "Craig Latta" <[hidden email]>
| To: "Dale Henrichs" <[hidden email]>
| Cc: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Sunday, September 30, 2012 11:14:26 PM
| Subject: Re: [GS/SS Beta] importing a lot of instance data from a text file
|
|
| > When you open on the server you cut the client (and the gci round
| > trips) out of the equation ... my experience is that working with
| > large files can be very slow over gci ...
|
|      Oh, I mean, the mystery is that with the client version I could
| read alphabetic characters but not linefeeds, and with the server
| version I could read everything.

You mean the #upTo: call was slow, or that when you did a #next and the result was lf, the response was slow?

|
| > If I could see more of your stack, I could probably figure out the
| > problem...
|
|      Sure, I could make an account for you and you could play around
| with it yourself?

Actually, a listing of the stack itself with the method/context in question highlighted would be enough for a starting point:)