How to find a string in a large number of strings ...

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to find a string in a large number of strings ...

GLASS mailing list
I want to find a string in a very large number of strings (3 millions
and increasing).

Should one use a simple Set ?

 -> means, that lots of memory is used and perhaps lots of RAM is needed
    in the GEM ... (total memory at least 40 Mbyte of data). Swapping ?

Should I use an UnorderedCollection of Strings with Index (Equality) ?

 -> how do I set an index on a set with Strings ?

 -> perhaps like:
       aSetOfStrings createEqualityIndexOn: '' withLastElementClass:
String ?

Another problem is, that these strings-set change two times the day ...
mostly adding strings.


Any hint ?


Marten


--
Marten Feldtmann
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: How to find a string in a large number of strings ...

GLASS mailing list
On 03/18/2016 02:34 PM, [hidden email] via Glass wrote:

> I want to find a string in a very large number of strings (3 millions
> and increasing).
>
> Should one use a simple Set ?
>
>  -> means, that lots of memory is used and perhaps lots of RAM is needed
>     in the GEM ... (total memory at least 40 Mbyte of data). Swapping ?
>
> Should I use an UnorderedCollection of Strings with Index (Equality) ?
>
>  -> how do I set an index on a set with Strings ?
>
>  -> perhaps like:
>        aSetOfStrings createEqualityIndexOn: '' withLastElementClass:
> String ?
>
> Another problem is, that these strings-set change two times the day ...
> mostly adding strings.
>
>
> Any hint ?

What is the key by which you look up the string?

If it is the entire string, (the string is 'foobar' and I know I want
'foobar') use a Set, this will be very efficient (but if you know the
entire string, why do you need to look it up at all?)

If it is a prefix of the string (the string is 'foobar' but I only know
I want the string that starts 'foo') use an index.

If you want to look up from some substring (the string is 'foobar' and I
know only that I want a string that contains 'oba') you might want to
build a more complex structure.

Regards,

-Martin
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: How to find a string in a large number of strings ...

GLASS mailing list
Am 18.03.2016 um 23:31 schrieb Martin McClure:

>
> If it is the entire string, (the string is 'foobar' and I know I want
> 'foobar') use a Set, this will be very efficient (but if you know the
> entire string, why do you need to look it up at all?)

 The information is: is the string present within that set ... thats
all. I think I will start with a set - though I thought its a waste of
memory to have the whole stuff loaded into the gem memory and set
operations are memory based ...

--
Marten Feldtmann
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: How to find a string in a large number of strings ...

GLASS mailing list
On 03/19/2016 12:11 AM, [hidden email] wrote:

> Am 18.03.2016 um 23:31 schrieb Martin McClure:
>
>>
>> If it is the entire string, (the string is 'foobar' and I know I want
>> 'foobar') use a Set, this will be very efficient (but if you know the
>> entire string, why do you need to look it up at all?)
>
>   The information is: is the string present within that set ... thats
> all. I think I will start with a set - though I thought its a waste of
> memory to have the whole stuff loaded into the gem memory and set
> operations are memory based ...
>

It sounds like a Set is ideal, then.

Memory should not be a problem when doing lookups, even with very large
sets. When doing a lookup, first the hash of the string to be looked up
is calculated. This indicates where in the set the string will be, if it
is present. Then only a small portion of the set, including that
position, is faulted into memory and the lookup completed. The entire
set does not ever need to be in memory at once.

Regards,

-Martin
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass