Pre-Getting started info: Unicode, utf8, large memory need

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Pre-Getting started info: Unicode, utf8, large memory need

Charles Hixson-2
Squeak looks interesting, but before getting started I need 3 pieces of
info:
1)  How does one read&write utf8 files?
2)  Can strings by indexed by chars, even if they are unicode rather
than ascii?
3)  What happens if you have more data than will fit into RAM?

(For 3 "use a database" is an acceptable answer, but I'm hoping for
something involving automatic paging.)
(For 2 "use 4 bytes/char" is acceptable, but only if there's a good
answer to 3.)

I thought I could just look this up in the documentation, but it doesn't
seem to address these points...at least not until you start being able
to read the code in the browser fluently.
These questions also don't seem to have been addressed in the mailing
list before.  (At least a search didn't find them.)

An additional, but much less urgent, question is "How does one use
Squeak on multiple cores of a multi-core processor?"
_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Pre-Getting started info: Unicode, utf8, large memory need

Herbert König
Hi Charles,

CH> 1)  How does one read&write utf8 files?

the FileStream class uses UTF8 files by default since some time. So if
you want non utf8 files (e.g. for speed reasons) you have to take
extra measures.

CH> 2)  Can strings by indexed by chars, even if they are unicode rather
CH> than ascii?

yes. See class WideString.

CH> 3)  What happens if you have more data than will fit into RAM?

use a database :-))

CH> (For 3 "use a database" is an acceptable answer, but I'm hoping for
CH> something involving automatic paging.)

There are object databases like Magma which make this less painful.
And OR mappers. Commercial products handle bigger than RAM images
(GemStone) of which I thought they would have a free version but can't
find it on their website.

CH> An additional, but much less urgent, question is "How does one use
CH> Squeak on multiple cores of a multi-core processor?"

There is an experiment "Hydra VM" which can run multiple images each
in their native thread. Squeak is a single OS thread and uses green
threads inside.

You might tell us, what you want to achieve. Personally I'd say start
small :-)


--
Cheers,

Herbert  

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Pre-Getting started info: Unicode, utf8, large memory need

Charles Hixson-2
On 04/27/2010 09:10 PM, Herbert König wrote:

> Hi Charles,
>
> CH>  1)  How does one read&write utf8 files?
>
> the FileStream class uses UTF8 files by default since some time. So if
> you want non utf8 files (e.g. for speed reasons) you have to take
> extra measures.
>
> CH>  2)  Can strings by indexed by chars, even if they are unicode rather
> CH>  than ascii?
>
> yes. See class WideString.
>
> CH>  3)  What happens if you have more data than will fit into RAM?
>
> use a database :-))
>
> CH>  (For 3 "use a database" is an acceptable answer, but I'm hoping for
> CH>  something involving automatic paging.)
>
> There are object databases like Magma which make this less painful.
> And OR mappers. Commercial products handle bigger than RAM images
> (GemStone) of which I thought they would have a free version but can't
> find it on their website.
>
> CH>  An additional, but much less urgent, question is "How does one use
> CH>  Squeak on multiple cores of a multi-core processor?"
>
> There is an experiment "Hydra VM" which can run multiple images each
> in their native thread. Squeak is a single OS thread and uses green
> threads inside.
>
> You might tell us, what you want to achieve. Personally I'd say start
> small :-)
>    

Well, I am starting small, but the database isn't all that small.  I'm
planning, as a first step, building a bibliographic database of
"interesting books" from GutenPrint (the Gutenberg Project).  They often
leave out things like "When was this first published?"  (Sometimes it
isn't known.)  that I want to include in my bibliography, and I also
want to include things like Story index and Author index for
publications (e.g. magazines) that have multiple stories with multiple
authors.  Some of this I've already done by hand, but unfortunately I
used two different formats, and also the info needs to be relocated to
the end of the file.  (I'm planning a table just prior to the "</body>"
tag.)

The next step is to generate catalogs from this bibliographic
information.  Then I want to package them together with the files onto
something that will fit onto a DVD by the middle of November.  (That
should be practical.)

The next step is to build indexes of names and where they appear.  Etc.
(I don't have the details planned out.  Automated information retrieval
is the goal, but not just free-form retrieval, and I don't know exactly
what I'll need to do. It's likely to require pre-computing a lot of
partial answers, though.)

I looked at Magma, and couldn't figure out whether it would be useful or
not.  I've no idea just how fast it is, how capacious it is, or how much
ram it consumes, and I don't even know what I should measure.  It's the
kind of thing that could look like it was working fine until one
suddenly passed some critical usage level, and then it would just barely
work at all, and I can't guess how one could determine that usage level
ahead of time.  And I want locally separate files, so I guess I'd
probably use sqlite or Firebird.  With Sqlite I might need to have
multiple databases to handle the final system, so it would probably be
best to partition things early.  (Either that or build some sort of
hierarchical storage system that rolled things from database to database
depending of how recently it was accessed.)

I'm guessing that FileStream would handle file BOM markers gracefully.  
(Most of my files are utf8 with BOM markers at the head.)  This isn't
totally standard, as many utf8 files don't have any markers to show that
they aren't ascii (or extended ascii), but it's ONE of the standard
approaches.

(I wouldn't need any fancy mapper.  If I weren't dealing with LOTS of
variable length arrays of variable length strings, I could just fit the
data into a simple C struct without any pointers whatsoever.  So all I
need is to be able to save a list of lists of chars, plus a few integers
that would all fit comfortably into 32 bits.  [Many of them would fit
into 8 bits.])

So far I'm still choosing the language.  I've got one routine
implemented in D, Python, Ruby, and Java so far.  Those could all be
made to work.  I'm currently working on a Vala implementation, and I'm
considering a Smalltalk one.  If D had the libraries for later use, it
would be the clear winner so far.  Unfortunately, I'm also considering
later, and D doesn't have much in the way of concurrency handling.  I'm
not sure that Hydra counts...though it sounds like I need to look into
it.  The question would be how to programs running on separate virtual
machines communicate with each other.  (N.B.:  Ruby and Python also have
this problem.  Vala appears to have solved it.)

I also considered "go", but it appears to be to beta at the moment.  The
design of the language poses unique requirements on the documentation
that they don't seem to be addressing.  (It could be because the
language is still in an early stage of development.)

Long term goal (1-4 decades):  A librarian program that can dig the
answers to "reasonable" questions out of the books that it handles.  And
can also recommend books in answer to slightly less reasonable questions.

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re[2]: [Newbies] Pre-Getting started info: Unicode, utf8, large memory need

Herbert König
Hi Charles,

seems you are on top of things. So just a few remarks. My experience
is from Squeak 3.8 so you should check if what I say holds true for
current Squeak.

Check out the UTF8 speed. I combine tab delimited files from disparate
sources into more complex objects and write out new files. First thing
was to change to non UTF8 for speed reasons. Seems you can't do this.

CH> I looked at Magma, and couldn't figure out whether it would be useful or
CH> not.  I've no idea just how fast it is, how capacious it is, or how much

Chris Muller is on Squeak dev and I'm sure he will be able to tell you
if you would hit the limits of Magma. Gjallar (www.Gjallar.se) uses
Magma in a commercial project (last time I looked).

CH> ahead of time.  And I want locally separate files, so I guess I'd
CH> probably use sqlite or Firebird.  With Sqlite I might need to have
CH> multiple databases to handle the final system, so it would probably be
CH> best to partition things early.  (Either that or build some sort of
CH> hierarchical storage system that rolled things from database to database
CH> depending of how recently it was accessed.)

SqueakDbx or (openDbx in other languages) might be of interest. I use
mysql from Squeak in a commercial setting, no problems.

CH> I'm guessing that FileStream would handle file BOM markers gracefully.
CH> (Most of my files are utf8 with BOM markers at the head.)  This isn't

Just try it to be sure..

CH> (I wouldn't need any fancy mapper.  If I weren't dealing with LOTS of
CH> variable length arrays of variable length strings, I could just fit the
CH> data into a simple C struct without any pointers whatsoever.  So all I
CH> need is to be able to save a list of lists of chars, plus a few integers
CH> that would all fit comfortably into 32 bits.  [Many of them would fit
CH> into 8 bits.])

CouchDB has caught my attention for inhomogeneous data, scalability,
replication. But then I consider javascript a nice functional language
and I like JSON (available in Squeak). At least look at map reduce
algorithm for being able to utilize multi-core or multiple boxes.
Whatever language you choose.

CH> later, and D doesn't have much in the way of concurrency handling.  I'm
CH> not sure that Hydra counts...though it sounds like I need to look into
CH> it.  The question would be how to programs running on separate virtual
CH> machines communicate with each other.

Two different issues, Hydra addresses one single machine and does not
support current Squeak (recent discussion on Squeak dev). The other
issue is communicating via network. This is where you'll end up.


--
Cheers,

Herbert  

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re[2]: [Newbies] Pre-Getting started info: Unicode, utf8, large memory need

Herbert König
In reply to this post by Charles Hixson-2
Hi Charles,

>>
>> CH>  1)  How does one read&write utf8 files?

we got OT here (my fault) so in case of interest, lets' split this into
private an on topic list discussion.


--
Cheers,

Herbert  

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Pre-Getting started info: Unicode, utf8, large memory need

Charles Hixson-2
In reply to this post by Herbert König
On 04/28/2010 09:31 PM, Herbert König wrote:

> Hi Charles,
>
> seems you are on top of things. So just a few remarks. My experience
> is from Squeak 3.8 so you should check if what I say holds true for
> current Squeak.
>
> Check out the UTF8 speed. I combine tab delimited files from disparate
> sources into more complex objects and write out new files. First thing
> was to change to non UTF8 for speed reasons. Seems you can't do this.
>    
I'm not worried about speed for this first part, and for the follow-up
I'm more worried about computational speed than utf8 reading speed.  If
I can't depend on virtual memory and automatic roll-in/out (nobody seems
to offer that!) then it means LOTS of database interaction.  Which is
where I get worried about Magma...as apparently it holds a partial
reference to everything in RAM.

> CH>  I looked at Magma, and couldn't figure out whether it would be useful or
> CH>  not.  I've no idea just how fast it is, how capacious it is, or how much
>
> Chris Muller is on Squeak dev and I'm sure he will be able to tell you
> if you would hit the limits of Magma. Gjallar (www.Gjallar.se) uses
> Magma in a commercial project (last time I looked).
>
> CH>  ahead of time.  And I want locally separate files, so I guess I'd
> CH>  probably use sqlite or Firebird.  With Sqlite I might need to have
> CH>  multiple databases to handle the final system, so it would probably be
> CH>  best to partition things early.  (Either that or build some sort of
> CH>  hierarchical storage system that rolled things from database to database
> CH>  depending of how recently it was accessed.)
>
> SqueakDbx or (openDbx in other languages) might be of interest. I use
> mysql from Squeak in a commercial setting, no problems.
>    
That is of interest, but MySql is in the same boat as PostGreSQL with
having a system level database rather and separate database files.  This
makes many of the uses that I intend problematical...and difficult at
best.  Both Firebird and Sqlite, however, allow specified db files.  
Sqlite is more common, so that's probably what I'll choose, even though
Firebird has a reputation for being more efficient.  (However I think
both are supported by openDbx, so probably also by SqueakDbx.)
> CH>  I'm guessing that FileStream would handle file BOM markers gracefully.
> CH>  (Most of my files are utf8 with BOM markers at the head.)  This isn't
>
> Just try it to be sure..
>    
Yeah, that will be a part of the first test.

> CH>  (I wouldn't need any fancy mapper.  If I weren't dealing with LOTS of
> CH>  variable length arrays of variable length strings, I could just fit the
> CH>  data into a simple C struct without any pointers whatsoever.  So all I
> CH>  need is to be able to save a list of lists of chars, plus a few integers
> CH>  that would all fit comfortably into 32 bits.  [Many of them would fit
> CH>  into 8 bits.])
>
> CouchDB has caught my attention for inhomogeneous data, scalability,
> replication. But then I consider javascript a nice functional language
> and I like JSON (available in Squeak). At least look at map reduce
> algorithm for being able to utilize multi-core or multiple boxes.
> Whatever language you choose.
>    
Multiple boxes isn't particularly interesting, but I'm expecting the
number of cores/box to ramp up quickly over the next decade...and that
*is* interesting.
> CH>  later, and D doesn't have much in the way of concurrency handling.  I'm
> CH>  not sure that Hydra counts...though it sounds like I need to look into
> CH>  it.  The question would be how to programs running on separate virtual
> CH>  machines communicate with each other.
>
> Two different issues, Hydra addresses one single machine and does not
> support current Squeak (recent discussion on Squeak dev). The other
> issue is communicating via network. This is where you'll end up.
>    
I don't expect to end up "communicating via network", except, perhaps,
via localhost.  But I do expect to end up running several processes,
probably on different cores.  This causes many, but not all, of the same
problems.  (Current support is less important, as this is something a
bit off in the future.  But it needs to be planned for now, before I
start writing the code.)  Guess I'll see if I can find that "Squeak dev"
discussion.  Perhaps Dbus is the correct answer...I've only skimmed over
its specs, but it looks plausible.  (Getting info back from separate
processes seems a major problem with most of the approaches.  It may
well turn out that TCP over UnixSockets is the best approach
available..though I *would* like something better.)
_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Pre-Getting started info: Unicode, utf8, large memory need

Paul DeBruicker
In reply to this post by Charles Hixson-2
 I think you might benefit from looking at Gemstone, especially the
 free version.  You haven't mentioned the total size of your
 planned DB, but up to 4GB is free. After that you pay but its
 sufficient to prove what you're doing. They seem to have the features
 you're looking for.


See:
http://seaside.gemstone.com/

for their free version.  

They have a mailing list here:
http://seaside.gemstone.com/mailman/listinfo/beta




> Message: 7
> Date: Thu, 29 Apr 2010 11:26:41 -0700
> From: Charles Hixson <[hidden email]>
> Subject: Re: [Newbies]  Pre-Getting started info: Unicode, utf8, large
> memory need
> To: "A friendly place to get answers to even the most basic questions
> about Squeak." <[hidden email]>
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 04/28/2010 09:31 PM, Herbert König wrote:
> > Hi Charles,
> >
> > seems you are on top of things. So just a few remarks. My experience
> > is from Squeak 3.8 so you should check if what I say holds true for
> > current Squeak.
> >
> > Check out the UTF8 speed. I combine tab delimited files from
> > disparate sources into more complex objects and write out new
> > files. First thing was to change to non UTF8 for speed reasons.
> > Seems you can't do this.
> I'm not worried about speed for this first part, and for the
> follow-up I'm more worried about computational speed than utf8
> reading speed.  If I can't depend on virtual memory and automatic
> roll-in/out (nobody seems to offer that!) then it means LOTS of
> database interaction.  Which is where I get worried about Magma...as
> apparently it holds a partial reference to everything in RAM.
> > CH>  I looked at Magma, and couldn't figure out whether it would be
> > CH> useful or not.  I've no idea just how fast it is, how capacious
> > CH> it is, or how much
> >
> > Chris Muller is on Squeak dev and I'm sure he will be able to tell
> > you if you would hit the limits of Magma. Gjallar (www.Gjallar.se)
> > uses Magma in a commercial project (last time I looked).
> >
> > CH>  ahead of time.  And I want locally separate files, so I guess
> > CH> I'd probably use sqlite or Firebird.  With Sqlite I might need
> > CH> to have multiple databases to handle the final system, so it
> > CH> would probably be best to partition things early.  (Either that
> > CH> or build some sort of hierarchical storage system that rolled
> > CH> things from database to database depending of how recently it
> > CH> was accessed.)
> >
> > SqueakDbx or (openDbx in other languages) might be of interest. I
> > use mysql from Squeak in a commercial setting, no problems.
> >    
> That is of interest, but MySql is in the same boat as PostGreSQL with
> having a system level database rather and separate database files.
> This makes many of the uses that I intend problematical...and
> difficult at best.  Both Firebird and Sqlite, however, allow
> specified db files. Sqlite is more common, so that's probably what
> I'll choose, even though Firebird has a reputation for being more
> efficient.  (However I think both are supported by openDbx, so
> probably also by SqueakDbx.)
> > CH>  I'm guessing that FileStream would handle file BOM markers
> > CH> gracefully. (Most of my files are utf8 with BOM markers at the
> > CH> head.)  This isn't
> >
> > Just try it to be sure..
> >    
> Yeah, that will be a part of the first test.
> > CH>  (I wouldn't need any fancy mapper.  If I weren't dealing with
> > CH> LOTS of variable length arrays of variable length strings, I
> > CH> could just fit the data into a simple C struct without any
> > CH> pointers whatsoever.  So all I need is to be able to save a
> > CH> list of lists of chars, plus a few integers that would all fit
> > CH> comfortably into 32 bits.  [Many of them would fit into 8
> > CH> bits.])
> >
> > CouchDB has caught my attention for inhomogeneous data, scalability,
> > replication. But then I consider javascript a nice functional
> > language and I like JSON (available in Squeak). At least look at
> > map reduce algorithm for being able to utilize multi-core or
> > multiple boxes. Whatever language you choose.
> >    
> Multiple boxes isn't particularly interesting, but I'm expecting the
> number of cores/box to ramp up quickly over the next decade...and
> that *is* interesting.
> > CH>  later, and D doesn't have much in the way of concurrency
> > CH> handling.  I'm not sure that Hydra counts...though it sounds
> > CH> like I need to look into it.  The question would be how to
> > CH> programs running on separate virtual machines communicate with
> > CH> each other.
> >
> > Two different issues, Hydra addresses one single machine and does
> > not support current Squeak (recent discussion on Squeak dev). The
> > other issue is communicating via network. This is where you'll end
> > up.
> I don't expect to end up "communicating via network", except,
> perhaps, via localhost.  But I do expect to end up running several
> processes, probably on different cores.  This causes many, but not
> all, of the same problems.  (Current support is less important, as
> this is something a bit off in the future.  But it needs to be
> planned for now, before I start writing the code.)  Guess I'll see if
> I can find that "Squeak dev" discussion.  Perhaps Dbus is the correct
> answer...I've only skimmed over its specs, but it looks plausible.
> (Getting info back from separate processes seems a major problem with
> most of the approaches.  It may well turn out that TCP over
> UnixSockets is the best approach available..though I *would* like
> something better.)
>
>
> ------------------------------
>
> _______________________________________________
> Beginners mailing list
> [hidden email]
> http://lists.squeakfoundation.org/mailman/listinfo/beginners
>
>
> End of Beginners Digest, Vol 48, Issue 34
> *****************************************

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Pre-Getting started info: Unicode, utf8, large memory need

hernanmd
In reply to this post by Charles Hixson-2
Hi Charles,

You'd like to see Opus, the FRBR-oo implementation in Squeak

http://www.frbr.org/2008/09/25/manzanos
http://www.caicyt.gov.ar/letodoc

Cheers,

Hernán

2010/4/28 Charles Hixson <[hidden email]>:

> On 04/27/2010 09:10 PM, Herbert König wrote:
>
> Long term goal (1-4 decades):  A librarian program that can dig the answers
> to "reasonable" questions out of the books that it handles.  And can also
> recommend books in answer to slightly less reasonable questions.
>
> _______________________________________________
> Beginners mailing list
> [hidden email]
> http://lists.squeakfoundation.org/mailman/listinfo/beginners
>
_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners