Databases with full text search support

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Databases with full text search support

laurent laffont
Hi,

I need to have full text indexing and searching over a 3-5 million documents. Actually the products I work on are written in PHP - MySQL. 

What would you use (open source - mandatory - our products are open source) in Smalltalk world  to do this ?  

Laurent Laffont - @lolgzs

Pharo Smalltalk Screencasts: http://www.pharocasts.com/
Blog: http://magaloma.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

Miguel Cobá
There isn't a full solution that I know of, but maybe you can use lucene
to store index info outside smalltalk and use the references to search
for the real smalltal objects.

There are also bindings for several nosql databases for lucene and solr
like mongo db and cassandra and there are bindings in smalltalk for
mondodb and cassandra.

Cheers

El jue, 27-01-2011 a las 20:48 +0100, laurent laffont escribió:

> Hi,
>
>
> I need to have full text indexing and searching over a 3-5 million
> documents. Actually the products I work on are written in PHP -
> MySQL.
>
>
> What would you use (open source - mandatory - our products are open
> source) in Smalltalk world  to do this ?  
>
> Laurent Laffont - @lolgzs
>
> Pharo Smalltalk Screencasts: http://www.pharocasts.com/
> Blog: http://magaloma.blogspot.com/
>

--
Miguel Cobá
http://twitter.com/MiguelCobaMtz
http://miguel.leugim.com.mx




Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

Stéphane Ducasse
In reply to this post by laurent laffont
I'm interested in your findings.
Did you also ask in squeak?

Stef

On Jan 27, 2011, at 8:48 PM, laurent laffont wrote:

> Hi,
>
> I need to have full text indexing and searching over a 3-5 million documents. Actually the products I work on are written in PHP - MySQL.
>
> What would you use (open source - mandatory - our products are open source) in Smalltalk world  to do this ?  
>
> Laurent Laffont - @lolgzs
>
> Pharo Smalltalk Screencasts: http://www.pharocasts.com/
> Blog: http://magaloma.blogspot.com/


Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

Diogenes Moreira
In this case, the better solutions could be is a NoSql Database such as CouchDB.. or other document oriented DB..

if I remember  well, exists a seaside couchDB integration.. 

Best.

On Thu, Jan 27, 2011 at 5:34 PM, Stéphane Ducasse <[hidden email]> wrote:
I'm interested in your findings.
Did you also ask in squeak?

Stef

On Jan 27, 2011, at 8:48 PM, laurent laffont wrote:

> Hi,
>
> I need to have full text indexing and searching over a 3-5 million documents. Actually the products I work on are written in PHP - MySQL.
>
> What would you use (open source - mandatory - our products are open source) in Smalltalk world  to do this ?
>
> Laurent Laffont - @lolgzs
>
> Pharo Smalltalk Screencasts: http://www.pharocasts.com/
> Blog: http://magaloma.blogspot.com/



Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

Levente Uzonyi-2
In reply to this post by laurent laffont
On Thu, 27 Jan 2011, laurent laffont wrote:

> Hi,
>
> I need to have full text indexing and searching over a 3-5 million
> documents. Actually the products I work on are written in PHP - MySQL.
>
> What would you use (open source - mandatory - our products are open source)
> in Smalltalk world  to do this ?

PostgreSQL of course. It's not Smalltalk, but there are at least 3
different bindings for it and it has built in full text indexing:
http://www.postgresql.org/docs/current/static/textsearch.html .


Levente

>
> Laurent Laffont - @lolgzs <http://twitter.com/#!/lolgzs>
>
> Pharo Smalltalk Screencasts: http://www.pharocasts.com/
> Blog: http://magaloma.blogspot.com/
>

Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

Dmitri Zagidulin
In reply to this post by Miguel Cobá
As far as bindings for MongoDB, if you're referring to
http://www.squeaksource.com/MongoTalk - though it's an excellent
start, it is not quite production ready. In that -
only the MongoDB equivalents of 'select' and 'insert' have been
implemented, but not update or delete. I'm working on implementing
those operations right now & submitting a patch, and should have a
post up about it in a
couple of weeks. But Mongo is not yet in the running, for full text search.

Dmitri Zagidulin
http://smalltalkzen.wordpress.com

On Thu, Jan 27, 2011 at 2:58 PM, Miguel Cobá <[hidden email]> wrote:

> There isn't a full solution that I know of, but maybe you can use lucene
> to store index info outside smalltalk and use the references to search
> for the real smalltal objects.
>
> There are also bindings for several nosql databases for lucene and solr
> like mongo db and cassandra and there are bindings in smalltalk for
> mondodb and cassandra.
>
> Cheers
>
> El jue, 27-01-2011 a las 20:48 +0100, laurent laffont escribió:
>> Hi,
>>
>>
>> I need to have full text indexing and searching over a 3-5 million
>> documents. Actually the products I work on are written in PHP -
>> MySQL.
>>
>>
>> What would you use (open source - mandatory - our products are open
>> source) in Smalltalk world  to do this ?
>>
>> Laurent Laffont - @lolgzs
>>
>> Pharo Smalltalk Screencasts: http://www.pharocasts.com/
>> Blog: http://magaloma.blogspot.com/
>>
>
> --
> Miguel Cobá
> http://twitter.com/MiguelCobaMtz
> http://miguel.leugim.com.mx
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

Miguel Cobá
El jue, 27-01-2011 a las 18:24 -0500, Dmitri Zagidulin escribió:
> As far as bindings for MongoDB, if you're referring to
> http://www.squeaksource.com/MongoTalk - though it's an excellent
> start, it is not quite production ready. In that -
> only the MongoDB equivalents of 'select' and 'insert' have been
> implemented, but not update or delete. I'm working on implementing
> those operations right now & submitting a patch, and should have a
> post up about it in a
> couple of weeks. But Mongo is not yet in the running, for full text search.

Oh, good to know.

--
Miguel Cobá
http://twitter.com/MiguelCobaMtz
http://miguel.leugim.com.mx




Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

Göran Krampe
Hi!

I think the absolutely most interesting NoSQL DB with this in mind is
Riak and its Riak Search. Google it. I have been itching to write a
binding for Riak but have had no time to do it.

Riak is built for "crazy just add another node" scaling and is modelled
after Dynamo. Recently Riak Search was added, which is a Lucene
compatible "overlay" for Riak thus giving you a free text search db with
insane scalability and robustness.

CouchDB does not have anything close to this scalability, unless you
look at Cloudant perhaps - but AFAIK it does not have any kind of free
text indexing. Yes, CouchDB has a Lucene integration (I have written a
binding in C# for CouchDB and the Lucene integration) but we are still
talking "single server" setup.

MongoDB is fast as hell, but AFAIK also does not have free text engine
nor the same "add another box" scalability like Riak has.

regards, Göran

Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

laurent laffont
Riak looks extremely erotic: facet queries, HTTP/JSON query interface... Thanks for the pointer.

Laurent. 

2011/1/28 Göran Krampe <[hidden email]>
Hi!

I think the absolutely most interesting NoSQL DB with this in mind is Riak and its Riak Search. Google it. I have been itching to write a binding for Riak but have had no time to do it.

Riak is built for "crazy just add another node" scaling and is modelled after Dynamo. Recently Riak Search was added, which is a Lucene compatible "overlay" for Riak thus giving you a free text search db with insane scalability and robustness.

CouchDB does not have anything close to this scalability, unless you look at Cloudant perhaps - but AFAIK it does not have any kind of free text indexing. Yes, CouchDB has a Lucene integration (I have written a binding in C# for CouchDB and the Lucene integration) but we are still talking "single server" setup.

MongoDB is fast as hell, but AFAIK also does not have free text engine nor the same "add another box" scalability like Riak has.

regards, Göran


Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

NorbertHartl
In reply to this post by laurent laffont

On 27.01.2011, at 20:48, laurent laffont wrote:

> Hi,
>
> I need to have full text indexing and searching over a 3-5 million documents. Actually the products I work on are written in PHP - MySQL.
>
> What would you use (open source - mandatory - our products are open source) in Smalltalk world  to do this ?  
>
That topic troubled me some time in the last years. The first approach I took was postgresql+tsearch2 because I used Glorp back then. AFAIK tsearch2 has been incorporated into postgresql in the meantime so this is no extra weight. Stemming libraries for postgresql are quite good and available for some languages. If you would use Glorp you can even extend the query language to support it out of the box.
I changed the setup as I moved from postgresql to gemstone. With the new setup I would have used postgresql only for the fulltext retrieval which I didn't like. I used SOLR [1] instead. It is a search server build upon lucene which supports faceted search and a lot of nifty stuff. I used magritte to generate the xml that I've put into SOLR. I was satisfied with this solution. But all solutions that need to cross the smalltalk boundary start to become cumbersome at some stage.
If I find time I would take another approach. I found a smalltalk stemming library while observing the moose mailing list. If you have good stemming support it may be a reasonable effort to produce forward and inverted indexes of your objects with search capabilities. If you use magritte (or pragmas) then you could add the fields to index there. The rest would be automatic: Taking all fields that should be searchable. Tokenize and canonicalize (using stemming) the words and build forward and inverted index. The only thing that would be needed is a change notification system that updates the index but I think this is more of an easy task. Well, that is the sketch what I would try to do myself but don't find time right now. I personally think that the power you gain by search smalltalk objects directly outweighs all the external servers by some extent.
If you use a special external database anyway that supports it, well, then this would be the easiest start.

hope this helps,

Norbert



Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

Mariano Martinez Peck
Laurent: several relational databases provvide full text serach capabilities.

You can send a SELECT statement with a MATCH() function to the server and get
the result of search back (in case of MySQL). I think OpenDBX
library has to do nothing special here: it depends on which query you write.


 MySQL and PostgreSQL server has support for full text search (aka.
tsearch in PostgreSQL I think).


cheers

mariano

On Fri, Jan 28, 2011 at 3:57 AM, Norbert Hartl <[hidden email]> wrote:

On 27.01.2011, at 20:48, laurent laffont wrote:

> Hi,
>
> I need to have full text indexing and searching over a 3-5 million documents. Actually the products I work on are written in PHP - MySQL.
>
> What would you use (open source - mandatory - our products are open source) in Smalltalk world  to do this ?
>
That topic troubled me some time in the last years. The first approach I took was postgresql+tsearch2 because I used Glorp back then. AFAIK tsearch2 has been incorporated into postgresql in the meantime so this is no extra weight. Stemming libraries for postgresql are quite good and available for some languages. If you would use Glorp you can even extend the query language to support it out of the box.
I changed the setup as I moved from postgresql to gemstone. With the new setup I would have used postgresql only for the fulltext retrieval which I didn't like. I used SOLR [1] instead. It is a search server build upon lucene which supports faceted search and a lot of nifty stuff. I used magritte to generate the xml that I've put into SOLR. I was satisfied with this solution. But all solutions that need to cross the smalltalk boundary start to become cumbersome at some stage.
If I find time I would take another approach. I found a smalltalk stemming library while observing the moose mailing list. If you have good stemming support it may be a reasonable effort to produce forward and inverted indexes of your objects with search capabilities. If you use magritte (or pragmas) then you could add the fields to index there. The rest would be automatic: Taking all fields that should be searchable. Tokenize and canonicalize (using stemming) the words and build forward and inverted index. The only thing that would be needed is a change notification system that updates the index but I think this is more of an easy task. Well, that is the sketch what I would try to do myself but don't find time right now. I personally think that the power you gain by search smalltalk objects directly outweighs all the external servers by some extent.
If you use a special external database anyway that supports it, well, then this would be the easiest start.

hope this helps,

Norbert




Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

laurent laffont
Indeed actually we use successfully MySQL full text search engine and also Lucene  (but not with Pharo ;)

I think I should try OpenDBX.

Laurent


On Fri, Jan 28, 2011 at 3:44 PM, Mariano Martinez Peck <[hidden email]> wrote:
Laurent: several relational databases provvide full text serach capabilities.

You can send a SELECT statement with a MATCH() function to the server and get
the result of search back (in case of MySQL). I think OpenDBX
library has to do nothing special here: it depends on which query you write.


 MySQL and PostgreSQL server has support for full text search (aka.
tsearch in PostgreSQL I think).


cheers

mariano


On Fri, Jan 28, 2011 at 3:57 AM, Norbert Hartl <[hidden email]> wrote:

On 27.01.2011, at 20:48, laurent laffont wrote:

> Hi,
>
> I need to have full text indexing and searching over a 3-5 million documents. Actually the products I work on are written in PHP - MySQL.
>
> What would you use (open source - mandatory - our products are open source) in Smalltalk world  to do this ?
>
That topic troubled me some time in the last years. The first approach I took was postgresql+tsearch2 because I used Glorp back then. AFAIK tsearch2 has been incorporated into postgresql in the meantime so this is no extra weight. Stemming libraries for postgresql are quite good and available for some languages. If you would use Glorp you can even extend the query language to support it out of the box.
I changed the setup as I moved from postgresql to gemstone. With the new setup I would have used postgresql only for the fulltext retrieval which I didn't like. I used SOLR [1] instead. It is a search server build upon lucene which supports faceted search and a lot of nifty stuff. I used magritte to generate the xml that I've put into SOLR. I was satisfied with this solution. But all solutions that need to cross the smalltalk boundary start to become cumbersome at some stage.
If I find time I would take another approach. I found a smalltalk stemming library while observing the moose mailing list. If you have good stemming support it may be a reasonable effort to produce forward and inverted indexes of your objects with search capabilities. If you use magritte (or pragmas) then you could add the fields to index there. The rest would be automatic: Taking all fields that should be searchable. Tokenize and canonicalize (using stemming) the words and build forward and inverted index. The only thing that would be needed is a change notification system that updates the index but I think this is more of an easy task. Well, that is the sketch what I would try to do myself but don't find time right now. I personally think that the power you gain by search smalltalk objects directly outweighs all the external servers by some extent.
If you use a special external database anyway that supports it, well, then this would be the easiest start.

hope this helps,

Norbert





Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

Mariano Martinez Peck


On Fri, Jan 28, 2011 at 9:54 AM, laurent laffont <[hidden email]> wrote:
Indeed actually we use successfully MySQL full text search engine and also Lucene  (but not with Pharo ;)

I think I should try OpenDBX.


You can also try the native MySQL client for Squeak. Not sure about its current state, but you don't need any library.
http://www.squeaksource.com/MySQL.html

For SqueakDBX, check everything in www.squeakdbx.org
 
Laurent



On Fri, Jan 28, 2011 at 3:44 PM, Mariano Martinez Peck <[hidden email]> wrote:
Laurent: several relational databases provvide full text serach capabilities.

You can send a SELECT statement with a MATCH() function to the server and get
the result of search back (in case of MySQL). I think OpenDBX
library has to do nothing special here: it depends on which query you write.


 MySQL and PostgreSQL server has support for full text search (aka.
tsearch in PostgreSQL I think).


cheers

mariano


On Fri, Jan 28, 2011 at 3:57 AM, Norbert Hartl <[hidden email]> wrote:

On 27.01.2011, at 20:48, laurent laffont wrote:

> Hi,
>
> I need to have full text indexing and searching over a 3-5 million documents. Actually the products I work on are written in PHP - MySQL.
>
> What would you use (open source - mandatory - our products are open source) in Smalltalk world  to do this ?
>
That topic troubled me some time in the last years. The first approach I took was postgresql+tsearch2 because I used Glorp back then. AFAIK tsearch2 has been incorporated into postgresql in the meantime so this is no extra weight. Stemming libraries for postgresql are quite good and available for some languages. If you would use Glorp you can even extend the query language to support it out of the box.
I changed the setup as I moved from postgresql to gemstone. With the new setup I would have used postgresql only for the fulltext retrieval which I didn't like. I used SOLR [1] instead. It is a search server build upon lucene which supports faceted search and a lot of nifty stuff. I used magritte to generate the xml that I've put into SOLR. I was satisfied with this solution. But all solutions that need to cross the smalltalk boundary start to become cumbersome at some stage.
If I find time I would take another approach. I found a smalltalk stemming library while observing the moose mailing list. If you have good stemming support it may be a reasonable effort to produce forward and inverted indexes of your objects with search capabilities. If you use magritte (or pragmas) then you could add the fields to index there. The rest would be automatic: Taking all fields that should be searchable. Tokenize and canonicalize (using stemming) the words and build forward and inverted index. The only thing that would be needed is a change notification system that updates the index but I think this is more of an easy task. Well, that is the sketch what I would try to do myself but don't find time right now. I personally think that the power you gain by search smalltalk objects directly outweighs all the external servers by some extent.
If you use a special external database anyway that supports it, well, then this would be the easiest start.

hope this helps,

Norbert






Reply | Threaded
Open this post in threaded view
|

Re: Databases with full text search support

S Krish

I would recommend using Lucene indexing for the documents for various reasons.. not limited to the fact that if one is interested in indexing documents and millions of them.. it obviates talking of Databases...

Now with Lucene, I would recommend using the Groovy bridge with XMLRpc from Pharo to be able to do anything you may want to pass back n forth the path to the documents.. and the fileentry properties of the document found through the indexed search..

Coincidentally, I have been intending to complete the half done work on this.. for last one year.. for our internal div..!

-Skrish


On Fri, Jan 28, 2011 at 11:53 PM, Mariano Martinez Peck <[hidden email]> wrote:


On Fri, Jan 28, 2011 at 9:54 AM, laurent laffont <[hidden email]> wrote:
Indeed actually we use successfully MySQL full text search engine and also Lucene  (but not with Pharo ;)

I think I should try OpenDBX.


You can also try the native MySQL client for Squeak. Not sure about its current state, but you don't need any library.
http://www.squeaksource.com/MySQL.html

For SqueakDBX, check everything in www.squeakdbx.org
 
Laurent



On Fri, Jan 28, 2011 at 3:44 PM, Mariano Martinez Peck <[hidden email]> wrote:
Laurent: several relational databases provvide full text serach capabilities.

You can send a SELECT statement with a MATCH() function to the server and get
the result of search back (in case of MySQL). I think OpenDBX
library has to do nothing special here: it depends on which query you write.


 MySQL and PostgreSQL server has support for full text search (aka.
tsearch in PostgreSQL I think).


cheers

mariano


On Fri, Jan 28, 2011 at 3:57 AM, Norbert Hartl <[hidden email]> wrote:

On 27.01.2011, at 20:48, laurent laffont wrote:

> Hi,
>
> I need to have full text indexing and searching over a 3-5 million documents. Actually the products I work on are written in PHP - MySQL.
>
> What would you use (open source - mandatory - our products are open source) in Smalltalk world  to do this ?
>
That topic troubled me some time in the last years. The first approach I took was postgresql+tsearch2 because I used Glorp back then. AFAIK tsearch2 has been incorporated into postgresql in the meantime so this is no extra weight. Stemming libraries for postgresql are quite good and available for some languages. If you would use Glorp you can even extend the query language to support it out of the box.
I changed the setup as I moved from postgresql to gemstone. With the new setup I would have used postgresql only for the fulltext retrieval which I didn't like. I used SOLR [1] instead. It is a search server build upon lucene which supports faceted search and a lot of nifty stuff. I used magritte to generate the xml that I've put into SOLR. I was satisfied with this solution. But all solutions that need to cross the smalltalk boundary start to become cumbersome at some stage.
If I find time I would take another approach. I found a smalltalk stemming library while observing the moose mailing list. If you have good stemming support it may be a reasonable effort to produce forward and inverted indexes of your objects with search capabilities. If you use magritte (or pragmas) then you could add the fields to index there. The rest would be automatic: Taking all fields that should be searchable. Tokenize and canonicalize (using stemming) the words and build forward and inverted index. The only thing that would be needed is a change notification system that updates the index but I think this is more of an easy task. Well, that is the sketch what I would try to do myself but don't find time right now. I personally think that the power you gain by search smalltalk objects directly outweighs all the external servers by some extent.
If you use a special external database anyway that supports it, well, then this would be the easiest start.

hope this helps,

Norbert