Squeak looks interesting, but before getting started I need 3 pieces of
info: 1) How does one read&write utf8 files? 2) Can strings by indexed by chars, even if they are unicode rather than ascii? 3) What happens if you have more data than will fit into RAM? (For 3 "use a database" is an acceptable answer, but I'm hoping for something involving automatic paging.) (For 2 "use 4 bytes/char" is acceptable, but only if there's a good answer to 3.) I thought I could just look this up in the documentation, but it doesn't seem to address these points...at least not until you start being able to read the code in the browser fluently. These questions also don't seem to have been addressed in the mailing list before. (At least a search didn't find them.) An additional, but much less urgent, question is "How does one use Squeak on multiple cores of a multi-core processor?" _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
Hi Charles,
CH> 1) How does one read&write utf8 files? the FileStream class uses UTF8 files by default since some time. So if you want non utf8 files (e.g. for speed reasons) you have to take extra measures. CH> 2) Can strings by indexed by chars, even if they are unicode rather CH> than ascii? yes. See class WideString. CH> 3) What happens if you have more data than will fit into RAM? use a database :-)) CH> (For 3 "use a database" is an acceptable answer, but I'm hoping for CH> something involving automatic paging.) There are object databases like Magma which make this less painful. And OR mappers. Commercial products handle bigger than RAM images (GemStone) of which I thought they would have a free version but can't find it on their website. CH> An additional, but much less urgent, question is "How does one use CH> Squeak on multiple cores of a multi-core processor?" There is an experiment "Hydra VM" which can run multiple images each in their native thread. Squeak is a single OS thread and uses green threads inside. You might tell us, what you want to achieve. Personally I'd say start small :-) -- Cheers, Herbert _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
On 04/27/2010 09:10 PM, Herbert König wrote:
> Hi Charles, > > CH> 1) How does one read&write utf8 files? > > the FileStream class uses UTF8 files by default since some time. So if > you want non utf8 files (e.g. for speed reasons) you have to take > extra measures. > > CH> 2) Can strings by indexed by chars, even if they are unicode rather > CH> than ascii? > > yes. See class WideString. > > CH> 3) What happens if you have more data than will fit into RAM? > > use a database :-)) > > CH> (For 3 "use a database" is an acceptable answer, but I'm hoping for > CH> something involving automatic paging.) > > There are object databases like Magma which make this less painful. > And OR mappers. Commercial products handle bigger than RAM images > (GemStone) of which I thought they would have a free version but can't > find it on their website. > > CH> An additional, but much less urgent, question is "How does one use > CH> Squeak on multiple cores of a multi-core processor?" > > There is an experiment "Hydra VM" which can run multiple images each > in their native thread. Squeak is a single OS thread and uses green > threads inside. > > You might tell us, what you want to achieve. Personally I'd say start > small :-) > Well, I am starting small, but the database isn't all that small. I'm planning, as a first step, building a bibliographic database of "interesting books" from GutenPrint (the Gutenberg Project). They often leave out things like "When was this first published?" (Sometimes it isn't known.) that I want to include in my bibliography, and I also want to include things like Story index and Author index for publications (e.g. magazines) that have multiple stories with multiple authors. Some of this I've already done by hand, but unfortunately I used two different formats, and also the info needs to be relocated to the end of the file. (I'm planning a table just prior to the "</body>" tag.) The next step is to generate catalogs from this bibliographic information. Then I want to package them together with the files onto something that will fit onto a DVD by the middle of November. (That should be practical.) The next step is to build indexes of names and where they appear. Etc. (I don't have the details planned out. Automated information retrieval is the goal, but not just free-form retrieval, and I don't know exactly what I'll need to do. It's likely to require pre-computing a lot of partial answers, though.) I looked at Magma, and couldn't figure out whether it would be useful or not. I've no idea just how fast it is, how capacious it is, or how much ram it consumes, and I don't even know what I should measure. It's the kind of thing that could look like it was working fine until one suddenly passed some critical usage level, and then it would just barely work at all, and I can't guess how one could determine that usage level ahead of time. And I want locally separate files, so I guess I'd probably use sqlite or Firebird. With Sqlite I might need to have multiple databases to handle the final system, so it would probably be best to partition things early. (Either that or build some sort of hierarchical storage system that rolled things from database to database depending of how recently it was accessed.) I'm guessing that FileStream would handle file BOM markers gracefully. (Most of my files are utf8 with BOM markers at the head.) This isn't totally standard, as many utf8 files don't have any markers to show that they aren't ascii (or extended ascii), but it's ONE of the standard approaches. (I wouldn't need any fancy mapper. If I weren't dealing with LOTS of variable length arrays of variable length strings, I could just fit the data into a simple C struct without any pointers whatsoever. So all I need is to be able to save a list of lists of chars, plus a few integers that would all fit comfortably into 32 bits. [Many of them would fit into 8 bits.]) So far I'm still choosing the language. I've got one routine implemented in D, Python, Ruby, and Java so far. Those could all be made to work. I'm currently working on a Vala implementation, and I'm considering a Smalltalk one. If D had the libraries for later use, it would be the clear winner so far. Unfortunately, I'm also considering later, and D doesn't have much in the way of concurrency handling. I'm not sure that Hydra counts...though it sounds like I need to look into it. The question would be how to programs running on separate virtual machines communicate with each other. (N.B.: Ruby and Python also have this problem. Vala appears to have solved it.) I also considered "go", but it appears to be to beta at the moment. The design of the language poses unique requirements on the documentation that they don't seem to be addressing. (It could be because the language is still in an early stage of development.) Long term goal (1-4 decades): A librarian program that can dig the answers to "reasonable" questions out of the books that it handles. And can also recommend books in answer to slightly less reasonable questions. _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
Hi Charles,
seems you are on top of things. So just a few remarks. My experience is from Squeak 3.8 so you should check if what I say holds true for current Squeak. Check out the UTF8 speed. I combine tab delimited files from disparate sources into more complex objects and write out new files. First thing was to change to non UTF8 for speed reasons. Seems you can't do this. CH> I looked at Magma, and couldn't figure out whether it would be useful or CH> not. I've no idea just how fast it is, how capacious it is, or how much Chris Muller is on Squeak dev and I'm sure he will be able to tell you if you would hit the limits of Magma. Gjallar (www.Gjallar.se) uses Magma in a commercial project (last time I looked). CH> ahead of time. And I want locally separate files, so I guess I'd CH> probably use sqlite or Firebird. With Sqlite I might need to have CH> multiple databases to handle the final system, so it would probably be CH> best to partition things early. (Either that or build some sort of CH> hierarchical storage system that rolled things from database to database CH> depending of how recently it was accessed.) SqueakDbx or (openDbx in other languages) might be of interest. I use mysql from Squeak in a commercial setting, no problems. CH> I'm guessing that FileStream would handle file BOM markers gracefully. CH> (Most of my files are utf8 with BOM markers at the head.) This isn't Just try it to be sure.. CH> (I wouldn't need any fancy mapper. If I weren't dealing with LOTS of CH> variable length arrays of variable length strings, I could just fit the CH> data into a simple C struct without any pointers whatsoever. So all I CH> need is to be able to save a list of lists of chars, plus a few integers CH> that would all fit comfortably into 32 bits. [Many of them would fit CH> into 8 bits.]) CouchDB has caught my attention for inhomogeneous data, scalability, replication. But then I consider javascript a nice functional language and I like JSON (available in Squeak). At least look at map reduce algorithm for being able to utilize multi-core or multiple boxes. Whatever language you choose. CH> later, and D doesn't have much in the way of concurrency handling. I'm CH> not sure that Hydra counts...though it sounds like I need to look into CH> it. The question would be how to programs running on separate virtual CH> machines communicate with each other. Two different issues, Hydra addresses one single machine and does not support current Squeak (recent discussion on Squeak dev). The other issue is communicating via network. This is where you'll end up. -- Cheers, Herbert _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
In reply to this post by Charles Hixson-2
Hi Charles,
>> >> CH> 1) How does one read&write utf8 files? we got OT here (my fault) so in case of interest, lets' split this into private an on topic list discussion. -- Cheers, Herbert _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
In reply to this post by Herbert König
On 04/28/2010 09:31 PM, Herbert König wrote:
> Hi Charles, > > seems you are on top of things. So just a few remarks. My experience > is from Squeak 3.8 so you should check if what I say holds true for > current Squeak. > > Check out the UTF8 speed. I combine tab delimited files from disparate > sources into more complex objects and write out new files. First thing > was to change to non UTF8 for speed reasons. Seems you can't do this. > I'm more worried about computational speed than utf8 reading speed. If I can't depend on virtual memory and automatic roll-in/out (nobody seems to offer that!) then it means LOTS of database interaction. Which is where I get worried about Magma...as apparently it holds a partial reference to everything in RAM. > CH> I looked at Magma, and couldn't figure out whether it would be useful or > CH> not. I've no idea just how fast it is, how capacious it is, or how much > > Chris Muller is on Squeak dev and I'm sure he will be able to tell you > if you would hit the limits of Magma. Gjallar (www.Gjallar.se) uses > Magma in a commercial project (last time I looked). > > CH> ahead of time. And I want locally separate files, so I guess I'd > CH> probably use sqlite or Firebird. With Sqlite I might need to have > CH> multiple databases to handle the final system, so it would probably be > CH> best to partition things early. (Either that or build some sort of > CH> hierarchical storage system that rolled things from database to database > CH> depending of how recently it was accessed.) > > SqueakDbx or (openDbx in other languages) might be of interest. I use > mysql from Squeak in a commercial setting, no problems. > having a system level database rather and separate database files. This makes many of the uses that I intend problematical...and difficult at best. Both Firebird and Sqlite, however, allow specified db files. Sqlite is more common, so that's probably what I'll choose, even though Firebird has a reputation for being more efficient. (However I think both are supported by openDbx, so probably also by SqueakDbx.) > CH> I'm guessing that FileStream would handle file BOM markers gracefully. > CH> (Most of my files are utf8 with BOM markers at the head.) This isn't > > Just try it to be sure.. > Yeah, that will be a part of the first test. > CH> (I wouldn't need any fancy mapper. If I weren't dealing with LOTS of > CH> variable length arrays of variable length strings, I could just fit the > CH> data into a simple C struct without any pointers whatsoever. So all I > CH> need is to be able to save a list of lists of chars, plus a few integers > CH> that would all fit comfortably into 32 bits. [Many of them would fit > CH> into 8 bits.]) > > CouchDB has caught my attention for inhomogeneous data, scalability, > replication. But then I consider javascript a nice functional language > and I like JSON (available in Squeak). At least look at map reduce > algorithm for being able to utilize multi-core or multiple boxes. > Whatever language you choose. > number of cores/box to ramp up quickly over the next decade...and that *is* interesting. > CH> later, and D doesn't have much in the way of concurrency handling. I'm > CH> not sure that Hydra counts...though it sounds like I need to look into > CH> it. The question would be how to programs running on separate virtual > CH> machines communicate with each other. > > Two different issues, Hydra addresses one single machine and does not > support current Squeak (recent discussion on Squeak dev). The other > issue is communicating via network. This is where you'll end up. > I don't expect to end up "communicating via network", except, perhaps, via localhost. But I do expect to end up running several processes, probably on different cores. This causes many, but not all, of the same problems. (Current support is less important, as this is something a bit off in the future. But it needs to be planned for now, before I start writing the code.) Guess I'll see if I can find that "Squeak dev" discussion. Perhaps Dbus is the correct answer...I've only skimmed over its specs, but it looks plausible. (Getting info back from separate processes seems a major problem with most of the approaches. It may well turn out that TCP over UnixSockets is the best approach available..though I *would* like something better.) _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
In reply to this post by Charles Hixson-2
I think you might benefit from looking at Gemstone, especially the
free version. You haven't mentioned the total size of your planned DB, but up to 4GB is free. After that you pay but its sufficient to prove what you're doing. They seem to have the features you're looking for. See: http://seaside.gemstone.com/ for their free version. They have a mailing list here: http://seaside.gemstone.com/mailman/listinfo/beta > Message: 7 > Date: Thu, 29 Apr 2010 11:26:41 -0700 > From: Charles Hixson <[hidden email]> > Subject: Re: [Newbies] Pre-Getting started info: Unicode, utf8, large > memory need > To: "A friendly place to get answers to even the most basic questions > about Squeak." <[hidden email]> > Message-ID: <[hidden email]> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > On 04/28/2010 09:31 PM, Herbert König wrote: > > Hi Charles, > > > > seems you are on top of things. So just a few remarks. My experience > > is from Squeak 3.8 so you should check if what I say holds true for > > current Squeak. > > > > Check out the UTF8 speed. I combine tab delimited files from > > disparate sources into more complex objects and write out new > > files. First thing was to change to non UTF8 for speed reasons. > > Seems you can't do this. > I'm not worried about speed for this first part, and for the > follow-up I'm more worried about computational speed than utf8 > reading speed. If I can't depend on virtual memory and automatic > roll-in/out (nobody seems to offer that!) then it means LOTS of > database interaction. Which is where I get worried about Magma...as > apparently it holds a partial reference to everything in RAM. > > CH> I looked at Magma, and couldn't figure out whether it would be > > CH> useful or not. I've no idea just how fast it is, how capacious > > CH> it is, or how much > > > > Chris Muller is on Squeak dev and I'm sure he will be able to tell > > you if you would hit the limits of Magma. Gjallar (www.Gjallar.se) > > uses Magma in a commercial project (last time I looked). > > > > CH> ahead of time. And I want locally separate files, so I guess > > CH> I'd probably use sqlite or Firebird. With Sqlite I might need > > CH> to have multiple databases to handle the final system, so it > > CH> would probably be best to partition things early. (Either that > > CH> or build some sort of hierarchical storage system that rolled > > CH> things from database to database depending of how recently it > > CH> was accessed.) > > > > SqueakDbx or (openDbx in other languages) might be of interest. I > > use mysql from Squeak in a commercial setting, no problems. > > > That is of interest, but MySql is in the same boat as PostGreSQL with > having a system level database rather and separate database files. > This makes many of the uses that I intend problematical...and > difficult at best. Both Firebird and Sqlite, however, allow > specified db files. Sqlite is more common, so that's probably what > I'll choose, even though Firebird has a reputation for being more > efficient. (However I think both are supported by openDbx, so > probably also by SqueakDbx.) > > CH> I'm guessing that FileStream would handle file BOM markers > > CH> gracefully. (Most of my files are utf8 with BOM markers at the > > CH> head.) This isn't > > > > Just try it to be sure.. > > > Yeah, that will be a part of the first test. > > CH> (I wouldn't need any fancy mapper. If I weren't dealing with > > CH> LOTS of variable length arrays of variable length strings, I > > CH> could just fit the data into a simple C struct without any > > CH> pointers whatsoever. So all I need is to be able to save a > > CH> list of lists of chars, plus a few integers that would all fit > > CH> comfortably into 32 bits. [Many of them would fit into 8 > > CH> bits.]) > > > > CouchDB has caught my attention for inhomogeneous data, scalability, > > replication. But then I consider javascript a nice functional > > language and I like JSON (available in Squeak). At least look at > > map reduce algorithm for being able to utilize multi-core or > > multiple boxes. Whatever language you choose. > > > Multiple boxes isn't particularly interesting, but I'm expecting the > number of cores/box to ramp up quickly over the next decade...and > that *is* interesting. > > CH> later, and D doesn't have much in the way of concurrency > > CH> handling. I'm not sure that Hydra counts...though it sounds > > CH> like I need to look into it. The question would be how to > > CH> programs running on separate virtual machines communicate with > > CH> each other. > > > > Two different issues, Hydra addresses one single machine and does > > not support current Squeak (recent discussion on Squeak dev). The > > other issue is communicating via network. This is where you'll end > > up. > I don't expect to end up "communicating via network", except, > perhaps, via localhost. But I do expect to end up running several > processes, probably on different cores. This causes many, but not > all, of the same problems. (Current support is less important, as > this is something a bit off in the future. But it needs to be > planned for now, before I start writing the code.) Guess I'll see if > I can find that "Squeak dev" discussion. Perhaps Dbus is the correct > answer...I've only skimmed over its specs, but it looks plausible. > (Getting info back from separate processes seems a major problem with > most of the approaches. It may well turn out that TCP over > UnixSockets is the best approach available..though I *would* like > something better.) > > > ------------------------------ > > _______________________________________________ > Beginners mailing list > [hidden email] > http://lists.squeakfoundation.org/mailman/listinfo/beginners > > > End of Beginners Digest, Vol 48, Issue 34 > ***************************************** _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
In reply to this post by Charles Hixson-2
Hi Charles,
You'd like to see Opus, the FRBR-oo implementation in Squeak http://www.frbr.org/2008/09/25/manzanos http://www.caicyt.gov.ar/letodoc Cheers, Hernán 2010/4/28 Charles Hixson <[hidden email]>: > On 04/27/2010 09:10 PM, Herbert König wrote: > > Long term goal (1-4 decades): A librarian program that can dig the answers > to "reasonable" questions out of the books that it handles. And can also > recommend books in answer to slightly less reasonable questions. > > _______________________________________________ > Beginners mailing list > [hidden email] > http://lists.squeakfoundation.org/mailman/listinfo/beginners > Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
Free forum by Nabble | Edit this page |