Günther Schmidt wrote:
> I have records of books (from a catalog system), and they've got keyword
> fields that contain 1-, 2- and 3-word keywords like:
>
> '4th century history Europe revolutinary history of Germany feminist
> movement'
>
> The keywords in the record above should then be:
>
> '4th century history'
> 'Europe'
> 'revolutionary history of Germany'
> 'feminist movement'
One idea would be to measure how strongly associated word pairs were, e.g. if
'feminist' is usually followed by 'history' then that's part of a multi-word
keyword, whereas 'feminist' is rarely followed by 'giraffe', so those are not
part of the same keyword.
Then the parsing would be:
get next word
does it "usually" follow the previous word (> some threshold) ?
ifTrue: still in same keyword sequence
ifFalse: finish previous sequence and start new one
The hardest part would be to find a suitable threshold. I suspect that if you
want to be formal about it then you'll have to get into (Bayesian?) statistics,
but I'd hope[*] that's not necessary.
([*] because I /loath/ stats, and am very bad at it ;-)
-- chris
Simple implementation -- untested:
follows := LookupTable new.
records do:
[:record || words |
words := record subStrings.
all addAll: words.
1 to: words size-1 do:
[:i || pair count |
pair := Array with: (words at: i) with: (words at: i+1)
count := follows at: pair ifAbsentPut: [0].
follows at: pair put: count+1]].