Hi,
I need to have full text indexing and searching over a 3-5 million documents. Actually the products I work on are written in PHP - MySQL. What would you use (open source - mandatory - our products are open source) in Smalltalk world to do this ?
Laurent Laffont - @lolgzs Pharo Smalltalk Screencasts: http://www.pharocasts.com/ Blog: http://magaloma.blogspot.com/ |
There isn't a full solution that I know of, but maybe you can use lucene
to store index info outside smalltalk and use the references to search for the real smalltal objects. There are also bindings for several nosql databases for lucene and solr like mongo db and cassandra and there are bindings in smalltalk for mondodb and cassandra. Cheers El jue, 27-01-2011 a las 20:48 +0100, laurent laffont escribió: > Hi, > > > I need to have full text indexing and searching over a 3-5 million > documents. Actually the products I work on are written in PHP - > MySQL. > > > What would you use (open source - mandatory - our products are open > source) in Smalltalk world to do this ? > > Laurent Laffont - @lolgzs > > Pharo Smalltalk Screencasts: http://www.pharocasts.com/ > Blog: http://magaloma.blogspot.com/ > -- Miguel Cobá http://twitter.com/MiguelCobaMtz http://miguel.leugim.com.mx |
In reply to this post by laurent laffont
I'm interested in your findings.
Did you also ask in squeak? Stef On Jan 27, 2011, at 8:48 PM, laurent laffont wrote: > Hi, > > I need to have full text indexing and searching over a 3-5 million documents. Actually the products I work on are written in PHP - MySQL. > > What would you use (open source - mandatory - our products are open source) in Smalltalk world to do this ? > > Laurent Laffont - @lolgzs > > Pharo Smalltalk Screencasts: http://www.pharocasts.com/ > Blog: http://magaloma.blogspot.com/ |
In this case, the better solutions could be is a NoSql Database such as CouchDB.. or other document oriented DB.. Best. On Thu, Jan 27, 2011 at 5:34 PM, Stéphane Ducasse <[hidden email]> wrote: I'm interested in your findings. |
In reply to this post by laurent laffont
On Thu, 27 Jan 2011, laurent laffont wrote:
> Hi, > > I need to have full text indexing and searching over a 3-5 million > documents. Actually the products I work on are written in PHP - MySQL. > > What would you use (open source - mandatory - our products are open source) > in Smalltalk world to do this ? PostgreSQL of course. It's not Smalltalk, but there are at least 3 different bindings for it and it has built in full text indexing: http://www.postgresql.org/docs/current/static/textsearch.html . Levente > > Laurent Laffont - @lolgzs <http://twitter.com/#!/lolgzs> > > Pharo Smalltalk Screencasts: http://www.pharocasts.com/ > Blog: http://magaloma.blogspot.com/ > |
In reply to this post by Miguel Cobá
As far as bindings for MongoDB, if you're referring to
http://www.squeaksource.com/MongoTalk - though it's an excellent start, it is not quite production ready. In that - only the MongoDB equivalents of 'select' and 'insert' have been implemented, but not update or delete. I'm working on implementing those operations right now & submitting a patch, and should have a post up about it in a couple of weeks. But Mongo is not yet in the running, for full text search. Dmitri Zagidulin http://smalltalkzen.wordpress.com On Thu, Jan 27, 2011 at 2:58 PM, Miguel Cobá <[hidden email]> wrote: > There isn't a full solution that I know of, but maybe you can use lucene > to store index info outside smalltalk and use the references to search > for the real smalltal objects. > > There are also bindings for several nosql databases for lucene and solr > like mongo db and cassandra and there are bindings in smalltalk for > mondodb and cassandra. > > Cheers > > El jue, 27-01-2011 a las 20:48 +0100, laurent laffont escribió: >> Hi, >> >> >> I need to have full text indexing and searching over a 3-5 million >> documents. Actually the products I work on are written in PHP - >> MySQL. >> >> >> What would you use (open source - mandatory - our products are open >> source) in Smalltalk world to do this ? >> >> Laurent Laffont - @lolgzs >> >> Pharo Smalltalk Screencasts: http://www.pharocasts.com/ >> Blog: http://magaloma.blogspot.com/ >> > > -- > Miguel Cobá > http://twitter.com/MiguelCobaMtz > http://miguel.leugim.com.mx > > > > > |
El jue, 27-01-2011 a las 18:24 -0500, Dmitri Zagidulin escribió:
> As far as bindings for MongoDB, if you're referring to > http://www.squeaksource.com/MongoTalk - though it's an excellent > start, it is not quite production ready. In that - > only the MongoDB equivalents of 'select' and 'insert' have been > implemented, but not update or delete. I'm working on implementing > those operations right now & submitting a patch, and should have a > post up about it in a > couple of weeks. But Mongo is not yet in the running, for full text search. Oh, good to know. -- Miguel Cobá http://twitter.com/MiguelCobaMtz http://miguel.leugim.com.mx |
Hi!
I think the absolutely most interesting NoSQL DB with this in mind is Riak and its Riak Search. Google it. I have been itching to write a binding for Riak but have had no time to do it. Riak is built for "crazy just add another node" scaling and is modelled after Dynamo. Recently Riak Search was added, which is a Lucene compatible "overlay" for Riak thus giving you a free text search db with insane scalability and robustness. CouchDB does not have anything close to this scalability, unless you look at Cloudant perhaps - but AFAIK it does not have any kind of free text indexing. Yes, CouchDB has a Lucene integration (I have written a binding in C# for CouchDB and the Lucene integration) but we are still talking "single server" setup. MongoDB is fast as hell, but AFAIK also does not have free text engine nor the same "add another box" scalability like Riak has. regards, Göran |
Riak looks extremely erotic: facet queries, HTTP/JSON query interface... Thanks for the pointer.
Laurent. 2011/1/28 Göran Krampe <[hidden email]> Hi! |
In reply to this post by laurent laffont
On 27.01.2011, at 20:48, laurent laffont wrote: > Hi, > > I need to have full text indexing and searching over a 3-5 million documents. Actually the products I work on are written in PHP - MySQL. > > What would you use (open source - mandatory - our products are open source) in Smalltalk world to do this ? > That topic troubled me some time in the last years. The first approach I took was postgresql+tsearch2 because I used Glorp back then. AFAIK tsearch2 has been incorporated into postgresql in the meantime so this is no extra weight. Stemming libraries for postgresql are quite good and available for some languages. If you would use Glorp you can even extend the query language to support it out of the box. I changed the setup as I moved from postgresql to gemstone. With the new setup I would have used postgresql only for the fulltext retrieval which I didn't like. I used SOLR [1] instead. It is a search server build upon lucene which supports faceted search and a lot of nifty stuff. I used magritte to generate the xml that I've put into SOLR. I was satisfied with this solution. But all solutions that need to cross the smalltalk boundary start to become cumbersome at some stage. If I find time I would take another approach. I found a smalltalk stemming library while observing the moose mailing list. If you have good stemming support it may be a reasonable effort to produce forward and inverted indexes of your objects with search capabilities. If you use magritte (or pragmas) then you could add the fields to index there. The rest would be automatic: Taking all fields that should be searchable. Tokenize and canonicalize (using stemming) the words and build forward and inverted index. The only thing that would be needed is a change notification system that updates the index but I think this is more of an easy task. Well, that is the sketch what I would try to do myself but don't find time right now. I personally think that the power you gain by search smalltalk objects directly outweighs all the external servers by some extent. If you use a special external database anyway that supports it, well, then this would be the easiest start. hope this helps, Norbert |
Laurent: several relational databases provvide full text serach capabilities.
You can send a SELECT statement with a MATCH() function to the server and get the result of search back (in case of MySQL). I think OpenDBX library has to do nothing special here: it depends on which query you write. tsearch in PostgreSQL I think). cheers mariano On Fri, Jan 28, 2011 at 3:57 AM, Norbert Hartl <[hidden email]> wrote:
|
Indeed actually we use successfully MySQL full text search engine and also Lucene (but not with Pharo ;) I think I should try OpenDBX. Laurent On Fri, Jan 28, 2011 at 3:44 PM, Mariano Martinez Peck <[hidden email]> wrote: Laurent: several relational databases provvide full text serach capabilities. |
On Fri, Jan 28, 2011 at 9:54 AM, laurent laffont <[hidden email]> wrote:
You can also try the native MySQL client for Squeak. Not sure about its current state, but you don't need any library. http://www.squeaksource.com/MySQL.html For SqueakDBX, check everything in www.squeakdbx.org Laurent |
I would recommend using Lucene indexing for the documents for various reasons.. not limited to the fact that if one is interested in indexing documents and millions of them.. it obviates talking of Databases... Now with Lucene, I would recommend using the Groovy bridge with XMLRpc from Pharo to be able to do anything you may want to pass back n forth the path to the documents.. and the fileentry properties of the document found through the indexed search.. Coincidentally, I have been intending to complete the half done work on this.. for last one year.. for our internal div..! -Skrish On Fri, Jan 28, 2011 at 11:53 PM, Mariano Martinez Peck <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |