Smalltalk › Gemtalk › GLASS

How to do optimized collection selects based on substring search?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

9 messages Options

NorbertHartl

How to do optimized collection selects based on substring search?

I have a bigger collection of users that I want to search by name and email. And both can be substring searches. I read again in the programming guide but the indexed unordered collections seem only to bring benefit if it can be done based on identity or equality.
What is a proper way to scale with substring searches. Is there something in gemstone that might help?

Norbert

Dale Henrichs

Re: How to do optimized collection selects based on substring search?

Norbert,

There's nothing out of the box for substring matching ... I know that there are c libraries out there that do this sort of thing and have been wondering what a Smalltalk implementation based on GemStone would look like, but haven't gotten into the algorithms to see how gnarly they might be ...

Dale

----- Original Message -----
| From: "Norbert Hartl" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Monday, November 21, 2011 2:43:40 PM
| Subject: [GS/SS Beta] How to do optimized collection selects based on substring search?
|
| I have a bigger collection of users that I want to search by name and
| email. And both can be substring searches. I read again in the
| programming guide but the indexed unordered collections seem only to
| bring benefit if it can be done based on identity or equality.
| What is a proper way to scale with substring searches. Is there
| something in gemstone that might help?
|
| Norbert

NorbertHartl

Re: How to do optimized collection selects based on substring search?

Am 22.11.2011 um 01:07 schrieb Dale Henrichs:

Norbert,

There's nothing out of the box for substring matching ... I know that there are c libraries out there that do this sort of thing and have been wondering what a Smalltalk implementation based on GemStone would look like, but haven't gotten into the algorithms to see how gnarly they might be ...

Here we go. The world of search indexes is huge and papers are all over the place. It really depends on what one thinks is sufficient for substring search.

- A substring can be finding a word inside a text. This is done mostly with an inverted index (or inverted file). Meaning you split the text into words and you build an index the has each word associated with all things where the word occurs (objects in our case). To minimize ambiguity you use stemming to reduce words to their base construct in order to be able to match nouns, verbs etc. with the same term. Additionally applying stop-words to avoid finding to much things when a query contains and, or, the, etc.

- A substring that is part of a word can be match by using suffix arrays. Meaning from a word "smalltalk" you generate the suffixes: malltalk, alltalk, lltalk, ltalk, talk, alk, lk and k. This are kept in a lexically sorted index so you can use binary seach to find them. From this point on you can combine binary search with forward substring match on the result set which gives you a true substring match. These can be optimized by using n-grams. A 3-gram would give you for "smalltalk": s, sm, sma, mal, all, llt, lta, tal, alk, lk and k. This can be used to increase speed of lookups as well as finding similarities. That opens the fuzzy/approximate can of....

Most of the structures are easy to create and less easy to maintain, e.g when you have sorted indexes. But my problem with them in gemstone is that they great a big amount of extra data that can when used fill up the shared page cache really quick. Leaving less space for your business objects.

Now that gemstone has a working ffi implementation one possibility could be to integrate a library. Lucene is a very prominent library in this respect. It is java and I don't think there is a easy way to integrate that. But there is also Lucy (http://incubator.apache.org/lucy/) which is a C library and something like a port of lucene. The description on the web site sounds good:

Apache Lucy is full-text search engine library written in C and targeted at dynamic languages. It is a "loose C" port of Apache Lucene™, a search engine library for Java.

Another question is how you maintain such an index. Back then I had apache solr as an extra server and contacted it via HTTP interface. I played a bit with the change notification in gemstone but I didn't have the impression this is a good way of integrating something like that. Well, having the support and having to maintain it manually would be a good first option.

And I'm wondering that especially in the web business noone seems to have that problem and the web is very search intensive. So I'm interested in all thoughts and remarks.

Norbert

Dale

----- Original Message -----
| From: "Norbert Hartl" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Monday, November 21, 2011 2:43:40 PM
| Subject: [GS/SS Beta] How to do optimized collection selects based on substring search?
|
| I have a bigger collection of users that I want to search by name and
| email. And both can be substring searches. I read again in the
| programming guide but the indexed unordered collections seem only to
| bring benefit if it can be done based on identity or equality.
| What is a proper way to scale with substring searches. Is there
| something in gemstone that might help?
|
| Norbert

Dale Henrichs

Re: How to do optimized collection selects based on substring search?

Norbert,

I wouldn't write-off a GemStone-only solution so quickly ... the indexing structures are designed to minimize impact of searches on the shared page cache so it's possible that you'd get good performance without monkeying with a separate data-base maintained in C ...

If it's not too much work to implement the basic algorithms based on simple collections, I'd be willing to extend the work for GemStone indexes and concurrent updates ... then we could start throwing some big texts at it and see what happens...

Dale

----- Original Message -----
| From: "Norbert Hartl" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Tuesday, November 22, 2011 11:00:37 AM
| Subject: Re: [GS/SS Beta] How to do optimized collection selects based on substring search?
|
|
|
|
| Am 22.11.2011 um 01:07 schrieb Dale Henrichs:
|
|
|
| Norbert,
|
| There's nothing out of the box for substring matching ... I know that
| there are c libraries out there that do this sort of thing and have
| been wondering what a Smalltalk implementation based on GemStone
| would look like, but haven't gotten into the algorithms to see how
| gnarly they might be ...
|
|
|
| Here we go. The world of search indexes is huge and papers are all
| over the place. It really depends on what one thinks is sufficient
| for substring search.
|
|
| - A substring can be finding a word inside a text. This is done
| mostly with an inverted index (or inverted file). Meaning you split
| the text into words and you build an index the has each word
| associated with all things where the word occurs (objects in our
| case). To minimize ambiguity you use stemming to reduce words to
| their base construct in order to be able to match nouns, verbs etc.
| with the same term. Additionally applying stop-words to avoid
| finding to much things when a query contains and, or, the, etc.
| - A substring that is part of a word can be match by using suffix
| arrays. Meaning from a word "smalltalk" you generate the suffixes:
| malltalk, alltalk, lltalk, ltalk, talk, alk, lk and k. This are kept
| in a lexically sorted index so you can use binary seach to find
| them. From this point on you can combine binary search with forward
| substring match on the result set which gives you a true substring
| match. These can be optimized by using n-grams. A 3-gram would give
| you for "smalltalk": s, sm, sma, mal, all, llt, lta, tal, alk, lk
| and k. This can be used to increase speed of lookups as well as
| finding similarities. That opens the fuzzy/approximate can of....
|
|
| Most of the structures are easy to create and less easy to maintain,
| e.g when you have sorted indexes. But my problem with them in
| gemstone is that they great a big amount of extra data that can when
| used fill up the shared page cache really quick. Leaving less space
| for your business objects.
|
|
| Now that gemstone has a working ffi implementation one possibility
| could be to integrate a library. Lucene is a very prominent library
| in this respect. It is java and I don't think there is a easy way to
| integrate that. But there is also Lucy (
| http://incubator.apache.org/lucy/ ) which is a C library and
| something like a port of lucene. The description on the web site
| sounds good:
|
|
| Apache Lucy is full-text search engine library written in C and
| targeted at dynamic languages. It is a "loose C" port of Apache
| Lucene ™, a search engine library for Java.
|
|
| Another question is how you maintain such an index. Back then I had
| apache solr as an extra server and contacted it via HTTP interface.
| I played a bit with the change notification in gemstone but I didn't
| have the impression this is a good way of integrating something like
| that. Well, having the support and having to maintain it manually
| would be a good first option.
|
|
| And I'm wondering that especially in the web business noone seems to
| have that problem and the web is very search intensive. So I'm
| interested in all thoughts and remarks.
|
|
| Norbert
|
|
|
|
| Dale
|
| ----- Original Message -----
| | From: "Norbert Hartl" < [hidden email] >
| | To: "GemStone Seaside beta discussion" < [hidden email]
| | >
| | Sent: Monday, November 21, 2011 2:43:40 PM
| | Subject: [GS/SS Beta] How to do optimized collection selects based
| | on substring search?
| |
| | I have a bigger collection of users that I want to search by name
| | and
| | email. And both can be substring searches. I read again in the
| | programming guide but the indexed unordered collections seem only
| | to
| | bring benefit if it can be done based on identity or equality.
| | What is a proper way to scale with substring searches. Is there
| | something in gemstone that might help?
| |
| | Norbert
|
|

Thelliez

Re: How to do optimized collection selects based on substring search?

Norbert,

We are playing Lucene as well.

Solr provides a REST api to Lucene. This is why I started looking at
the Zinc client to bridge GLASS and Solr. Work in progress on our
side...

Regards,
Thierry

NorbertHartl

Re: How to do optimized collection selects based on substring search?

In reply to this post by Dale Henrichs

Am 22.11.2011 um 20:37 schrieb Dale Henrichs:

> Norbert,
>
> I wouldn't write-off a GemStone-only solution so quickly ... the indexing structures are designed to minimize impact of searches on the shared page cache so it's possible that you'd get good performance without monkeying with a separate data-base maintained in C ...
>
Ok, good idea. Didn't think of the fact that the specialized collections are somehow treated under hood. I think the path based queries have b-tree like indexes attached, right? That is the most important part of building an additional index. And if the traversal is done natively without affecting the shared page cache this is neat. Well, Unions and intersections would be nice, too ;)
And I fully agree that a smalltalk solution is superior to a native attached one and a non-gemstone specific version is best.

> If it's not too much work to implement the basic algorithms based on simple collections, I'd be willing to extend the work for GemStone indexes and concurrent updates ... then we could start throwing some big texts at it and see what happens...
>
I could start easy. I don't have too much expertise in this topic. And I'm really bad in reading academic papers :) I would be really interested to get the gemstone specific part going myself. I'm thinking for some time about how to design code that can be used in pharo and gemstone where the gemstone specific part like specialized collections is not usable in pharo. I probably need you to share some of your black magic wisdom. Is there anything to think about besides path based collection queries and the RC classes?

Norbert

> Dale
>
> ----- Original Message -----
> | From: "Norbert Hartl" <[hidden email]>
> | To: "GemStone Seaside beta discussion" <[hidden email]>
> | Sent: Tuesday, November 22, 2011 11:00:37 AM
> | Subject: Re: [GS/SS Beta] How to do optimized collection selects based on substring search?
> |
> |
> |
> |
> | Am 22.11.2011 um 01:07 schrieb Dale Henrichs:
> |
> |
> |
> | Norbert,
> |
> | There's nothing out of the box for substring matching ... I know that
> | there are c libraries out there that do this sort of thing and have
> | been wondering what a Smalltalk implementation based on GemStone
> | would look like, but haven't gotten into the algorithms to see how
> | gnarly they might be ...
> |
> |
> |
> | Here we go. The world of search indexes is huge and papers are all
> | over the place. It really depends on what one thinks is sufficient
> | for substring search.
> |
> |
> | - A substring can be finding a word inside a text. This is done
> | mostly with an inverted index (or inverted file). Meaning you split
> | the text into words and you build an index the has each word
> | associated with all things where the word occurs (objects in our
> | case). To minimize ambiguity you use stemming to reduce words to
> | their base construct in order to be able to match nouns, verbs etc.
> | with the same term. Additionally applying stop-words to avoid
> | finding to much things when a query contains and, or, the, etc.
> | - A substring that is part of a word can be match by using suffix
> | arrays. Meaning from a word "smalltalk" you generate the suffixes:
> | malltalk, alltalk, lltalk, ltalk, talk, alk, lk and k. This are kept
> | in a lexically sorted index so you can use binary seach to find
> | them. From this point on you can combine binary search with forward
> | substring match on the result set which gives you a true substring
> | match. These can be optimized by using n-grams. A 3-gram would give
> | you for "smalltalk": s, sm, sma, mal, all, llt, lta, tal, alk, lk
> | and k. This can be used to increase speed of lookups as well as
> | finding similarities. That opens the fuzzy/approximate can of....
> |
> |
> | Most of the structures are easy to create and less easy to maintain,
> | e.g when you have sorted indexes. But my problem with them in
> | gemstone is that they great a big amount of extra data that can when
> | used fill up the shared page cache really quick. Leaving less space
> | for your business objects.
> |
> |
> | Now that gemstone has a working ffi implementation one possibility
> | could be to integrate a library. Lucene is a very prominent library
> | in this respect. It is java and I don't think there is a easy way to
> | integrate that. But there is also Lucy (
> | http://incubator.apache.org/lucy/ ) which is a C library and
> | something like a port of lucene. The description on the web site
> | sounds good:
> |
> |
> | Apache Lucy is full-text search engine library written in C and
> | targeted at dynamic languages. It is a "loose C" port of Apache
> | Lucene ™, a search engine library for Java.
> |
> |
> | Another question is how you maintain such an index. Back then I had
> | apache solr as an extra server and contacted it via HTTP interface.
> | I played a bit with the change notification in gemstone but I didn't
> | have the impression this is a good way of integrating something like
> | that. Well, having the support and having to maintain it manually
> | would be a good first option.
> |
> |
> | And I'm wondering that especially in the web business noone seems to
> | have that problem and the web is very search intensive. So I'm
> | interested in all thoughts and remarks.
> |
> |
> | Norbert
> |
> |
> |
> |
> | Dale
> |
> | ----- Original Message -----
> | | From: "Norbert Hartl" < [hidden email] >
> | | To: "GemStone Seaside beta discussion" < [hidden email]
> | | >
> | | Sent: Monday, November 21, 2011 2:43:40 PM
> | | Subject: [GS/SS Beta] How to do optimized collection selects based
> | | on substring search?
> | |
> | | I have a bigger collection of users that I want to search by name
> | | and
> | | email. And both can be substring searches. I read again in the
> | | programming guide but the indexed unordered collections seem only
> | | to
> | | bring benefit if it can be done based on identity or equality.
> | | What is a proper way to scale with substring searches. Is there
> | | something in gemstone that might help?
> | |
> | | Norbert
> |
> |

NorbertHartl

Re: How to do optimized collection selects based on substring search?

In reply to this post by Thelliez

Thierry,

Am 22.11.2011 um 20:55 schrieb Thierry Thelliez:

> Norbert,
>
> We are playing Lucene as well.
>
> Solr provides a REST api to Lucene. This is why I started looking at
> the Zinc client to bridge GLASS and Solr. Work in progress on our
> side...
>
as I wrote in the mail I used solr myself a few years ago. It worked quite ok. If we are talking about full text retrieval we talk about strings. An object that has a few instVars containing strings is a good match to lucene. It is just a lucene document where each field can hold an instvar content. I added an extra field with the objects oop to the document. Having the oop at hand objects can be queried in an easy and quick way when evaluating search results back in gemstone. There are drawbacks in using oop but to me it worked well.
While it is ok to use it it is an extra component which brings extra complexity. And you need to configure the server in order to have your objects properly handled. Maybe that can be done in a dynamic way today. The case I'm trying to solve is to find a user by fullname and email substring search. For this it is definitely overkill. That's why I'm in favour of a smalltalk based solution. If I would have more services that need fulltext retrieval it is a good idea to have this as a central installed component.

Anyway, I'm interested in what you are coming up with. So, please keep me/us updated.

Norbert

Dale Henrichs

Re: How to do optimized collection selects based on substring search?

In reply to this post by NorbertHartl

Norbert,

If I were you and wanted to begin to learn the black arts of indexing, I would start by familiarizing myself with the methods in UnorderedCollection *gemstone-indexing-selections[1]. The methods there allow you to construct queries without having to use the {} query blocks ... the {} query blocks won't compile in Pharo ... it should be possible to implement the various query methods in Pharo as well ...

The best way to familiarize yourself with those methods is to look at the tests in IndexedQueryExtensionsTestCase.

You are going to want to use RcIndentityBag for the collections and create RcEquality indexes on the collections ... this gives you the best chance for managing concurrent updates. Using identity and Strings means that you have to be careful (creative) when adding new entries: two sessions adding the same substring will be problematic ... perhaps a central gem for adding new substrings (that's what we do for symbols) ...

My first thought is to back into the problem: start with the types of queries that need to be written to determine the types of indexed collections you will need to manage ... have to use instance variables for path-based queries[2]. If you can specify all of the queries in terms of EQUALITY then things should be real straightforward. RANGES are relatively straightforward to handle as well. GemStone is pretty good at doing intersections of identity-based collections so the more complex queries ought to leverage this...

I suppose the first rule should have been to just get it working ... use the basic query API without worrying about indexes or performance just get it working and/or identify the places where the query API is awkward or just doesn't fit ... those are the areas where we need to invent new capabilities within the indexing system[3].

I suppose the process would be to break the problem up into a bunch of subsystems and tackle each one on its own. Start with the simple subsystems, solve the problems. Once you've solved the problems for each of the subsystems start putting the subsystems together and solve the problems. Rinse and repeat....

Dale

[1] There is potential for writing stream-based forms of all of those queries
[2] There is potential for adding collection-based indexes to GemStone 3.x
[3] Involves writing new primitives or creating support for [1] (which may involve new primitives)
----- Original Message -----
| From: "Norbert Hartl" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Tuesday, November 22, 2011 2:31:26 PM
| Subject: Re: [GS/SS Beta] How to do optimized collection selects based on substring search?
|
|
| Am 22.11.2011 um 20:37 schrieb Dale Henrichs:
|
| > Norbert,
| >
| > I wouldn't write-off a GemStone-only solution so quickly ... the
| > indexing structures are designed to minimize impact of searches on
| > the shared page cache so it's possible that you'd get good
| > performance without monkeying with a separate data-base maintained
| > in C ...
| >
| Ok, good idea. Didn't think of the fact that the specialized
| collections are somehow treated under hood. I think the path based
| queries have b-tree like indexes attached, right? That is the most
| important part of building an additional index. And if the traversal
| is done natively without affecting the shared page cache this is
| neat. Well, Unions and intersections would be nice, too ;)
| And I fully agree that a smalltalk solution is superior to a native
| attached one and a non-gemstone specific version is best.
|
| > If it's not too much work to implement the basic algorithms based
| > on simple collections, I'd be willing to extend the work for
| > GemStone indexes and concurrent updates ... then we could start
| > throwing some big texts at it and see what happens...
| >
| I could start easy. I don't have too much expertise in this topic.
| And I'm really bad in reading academic papers :) I would be really
| interested to get the gemstone specific part going myself. I'm
| thinking for some time about how to design code that can be used in
| pharo and gemstone where the gemstone specific part like specialized
| collections is not usable in pharo. I probably need you to share
| some of your black magic wisdom. Is there anything to think about
| besides path based collection queries and the RC classes?
|
| Norbert
|
| > Dale
| >
| > ----- Original Message -----
| > | From: "Norbert Hartl" <[hidden email]>
| > | To: "GemStone Seaside beta discussion"
| > | <[hidden email]>
| > | Sent: Tuesday, November 22, 2011 11:00:37 AM
| > | Subject: Re: [GS/SS Beta] How to do optimized collection selects
| > | based on substring search?
| > |
| > |
| > |
| > |
| > | Am 22.11.2011 um 01:07 schrieb Dale Henrichs:
| > |
| > |
| > |
| > | Norbert,
| > |
| > | There's nothing out of the box for substring matching ... I know
| > | that
| > | there are c libraries out there that do this sort of thing and
| > | have
| > | been wondering what a Smalltalk implementation based on GemStone
| > | would look like, but haven't gotten into the algorithms to see
| > | how
| > | gnarly they might be ...
| > |
| > |
| > |
| > | Here we go. The world of search indexes is huge and papers are
| > | all
| > | over the place. It really depends on what one thinks is
| > | sufficient
| > | for substring search.
| > |
| > |
| > | - A substring can be finding a word inside a text. This is done
| > | mostly with an inverted index (or inverted file). Meaning you
| > | split
| > | the text into words and you build an index the has each word
| > | associated with all things where the word occurs (objects in our
| > | case). To minimize ambiguity you use stemming to reduce words to
| > | their base construct in order to be able to match nouns, verbs
| > | etc.
| > | with the same term. Additionally applying stop-words to avoid
| > | finding to much things when a query contains and, or, the, etc.
| > | - A substring that is part of a word can be match by using suffix
| > | arrays. Meaning from a word "smalltalk" you generate the
| > | suffixes:
| > | malltalk, alltalk, lltalk, ltalk, talk, alk, lk and k. This are
| > | kept
| > | in a lexically sorted index so you can use binary seach to find
| > | them. From this point on you can combine binary search with
| > | forward
| > | substring match on the result set which gives you a true
| > | substring
| > | match. These can be optimized by using n-grams. A 3-gram would
| > | give
| > | you for "smalltalk": s, sm, sma, mal, all, llt, lta, tal, alk, lk
| > | and k. This can be used to increase speed of lookups as well as
| > | finding similarities. That opens the fuzzy/approximate can of....
| > |
| > |
| > | Most of the structures are easy to create and less easy to
| > | maintain,
| > | e.g when you have sorted indexes. But my problem with them in
| > | gemstone is that they great a big amount of extra data that can
| > | when
| > | used fill up the shared page cache really quick. Leaving less
| > | space
| > | for your business objects.
| > |
| > |
| > | Now that gemstone has a working ffi implementation one
| > | possibility
| > | could be to integrate a library. Lucene is a very prominent
| > | library
| > | in this respect. It is java and I don't think there is a easy way
| > | to
| > | integrate that. But there is also Lucy (
| > | http://incubator.apache.org/lucy/ ) which is a C library and
| > | something like a port of lucene. The description on the web site
| > | sounds good:
| > |
| > |
| > | Apache Lucy is full-text search engine library written in C and
| > | targeted at dynamic languages. It is a "loose C" port of Apache
| > | Lucene ™, a search engine library for Java.
| > |
| > |
| > | Another question is how you maintain such an index. Back then I
| > | had
| > | apache solr as an extra server and contacted it via HTTP
| > | interface.
| > | I played a bit with the change notification in gemstone but I
| > | didn't
| > | have the impression this is a good way of integrating something
| > | like
| > | that. Well, having the support and having to maintain it manually
| > | would be a good first option.
| > |
| > |
| > | And I'm wondering that especially in the web business noone seems
| > | to
| > | have that problem and the web is very search intensive. So I'm
| > | interested in all thoughts and remarks.
| > |
| > |
| > | Norbert
| > |
| > |
| > |
| > |
| > | Dale
| > |
| > | ----- Original Message -----
| > | | From: "Norbert Hartl" < [hidden email] >
| > | | To: "GemStone Seaside beta discussion" <
| > | | [hidden email]
| > | | >
| > | | Sent: Monday, November 21, 2011 2:43:40 PM
| > | | Subject: [GS/SS Beta] How to do optimized collection selects
| > | | based
| > | | on substring search?
| > | |
| > | | I have a bigger collection of users that I want to search by
| > | | name
| > | | and
| > | | email. And both can be substring searches. I read again in the
| > | | programming guide but the indexed unordered collections seem
| > | | only
| > | | to
| > | | bring benefit if it can be done based on identity or equality.
| > | | What is a proper way to scale with substring searches. Is there
| > | | something in gemstone that might help?
| > | |
| > | | Norbert
| > |
| > |
|
|

James Foster-8

Re: How to do optimized collection selects based on substring search?

> ----- Original Message -----
> | From: "Norbert Hartl" <[hidden email]>
> | To: "GemStone Seaside beta discussion" <[hidden email]>
> | Sent: Tuesday, November 22, 2011 2:31:26 PM
> | Subject: Re: [GS/SS Beta] How to do optimized collection selects based on substring search?
> |
> | Am 22.11.2011 um 20:37 schrieb Dale Henrichs:
> |
> | > Norbert,
> | >
> | > I wouldn't write-off a GemStone-only solution so quickly ... the
> | > indexing structures are designed to minimize impact of searches on
> | > the shared page cache so it's possible that you'd get good
> | > performance without monkeying with a separate data-base maintained
> | > in C ...
> | >
> | Ok, good idea. Didn't think of the fact that the specialized
> | collections are somehow treated under hood. I think the path based
> | queries have b-tree like indexes attached, right? That is the most
> | important part of building an additional index. And if the traversal
> | is done natively without affecting the shared page cache this is
> | neat.

Norbert,

As Dale said, indexes "minimize" the impact on the SPC. The idea that indexing "is done natively without affecting the shared page cache" is not quite right. Indexing structures are still objects (see BtreeNode and subclasses) and they exist on pages. The benefit is that b-tree structures are compact and can be traversed with a minimum number of page reads into the SPC.

James