Levenstein Distance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Levenstein Distance

Fernando Rodríguez
Hi,

Before I go and reinvent the wheel, is there any implementation out
there of the editing or Levenstein distance between 2 strings?

Thanks


Reply | Threaded
Open this post in threaded view
|

Re: Levenstein Distance

Chris Uppal-3
Fernando,

> Before I go and reinvent the wheel, is there any implementation out
> there of the editing or Levenstein distance between 2 strings?

See the C.L.S.D thread entitled "Fuzzy string comparison" dated around 10th May
2001.  Google has it.

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Levenstein Distance

seani-2
In reply to this post by Fernando Rodríguez
On Tue, 19 Jul 2005 07:21:21 -0700, Fernando wrote:

> Hi,
>
> Before I go and reinvent the wheel, is there any implementation out
> there of the editing or Levenstein distance between 2 strings?
>
> Thanks

I believe it's "Levenshtein". If only you had some way of fuzzy matching,
you'd probably find this page :-)

http://www.merriampark.com/ld.htm

No ST code, but nice clear implementations in other common languages.


Reply | Threaded
Open this post in threaded view
|

Re: Levenstein Distance

Torsten Bergmann-2
> > Before I go and reinvent the wheel, is there any implementation out
> > there of the editing or Levenstein distance between 2 strings?

> > Before I go and reinvent the wheel, is there any implementation out
> > there of the editing or Levenstein distance between 2 strings?

Hi,

http://www.beta4.com/mc/Fuzz-avi.9.mcz includes a Smalltalk
implementation for Levenshtein Distance. You can load it into Squeak
using the file list or adding http://www.beta4.com/mc/ as a Monticello
HTTP repository. If you dont have Squeak available just open the file
with your favourite ZIP program.

There is an extension on String called #levenshteinDistanceFrom:.
It was implemented by Avi Bryant (author of the Seaside Smalltalk web
framework).

Note that the code requires the class Array2D which can be found in the
SqueakMap repository (direct link to latest version is
http://map1.squeakfoundation.org/sm/accountbyid/a6930213-b578-49b1-862e-228cc5ab39e7/files/Array2D-md.1.mcz)

(in Squeak just open the package loader and load the Array2D package)

Bye
Torsten


Reply | Threaded
Open this post in threaded view
|

Re: Levenstein Distance

Avi  Bryant
Astares wrote:

> > > Before I go and reinvent the wheel, is there any implementation out
> > > there of the editing or Levenstein distance between 2 strings?
>
> > > Before I go and reinvent the wheel, is there any implementation out
> > > there of the editing or Levenstein distance between 2 strings?
>
> Hi,
>
> http://www.beta4.com/mc/Fuzz-avi.9.mcz includes a Smalltalk
> implementation for Levenshtein Distance.

If speed isn't too much of an issue, I've found the Ratcliff-Obershelp
implementation in that package to be the most useful for doing fuzzy
string matching in practice.

Also potentially interesting is the ternary search tree (TSTree)
implementation in my BTree package on SqueakMap, which acts as a
dictionary mapping strings to values, but can do lookups of "similar"
keys (to a specified distance) as well as prefix and exact matches.

They're useful in combination: I use the TSTree's prefix and similarity
matching to narrow the field, and then RatcliffObershelp to refine and
sort the results.

None of this should be very difficult to port to Smalltalks other than
Squeak.

Avi


Reply | Threaded
Open this post in threaded view
|

Re: Levenstein Distance

Udo Schneider
In reply to this post by Fernando Rodríguez
Fernando wrote:
> Before I go and reinvent the wheel, is there any implementation out
> there of the editing or Levenstein distance between 2 strings?
The following code is neither nice nor fast ... but it works ;-)

CU,

Udo



-------------------------------------------------------------------------------------------------------
String>>levenshteinDistance: aString
        | matrix colChar rowChar cost stream |
        self isEmpty ifTrue: [^aString size].
        aString isEmpty ifTrue: [^self size].
        matrix := Dictionary new.
        0 to: self size do: [:col | 0 to: aString size do: [:row | matrix at:
col @ row put: 0]].
        0 to: self size do: [:col | matrix at: col @ 0 put: col].
        0 to: aString size do: [:row | matrix at: 0 @ row put: row].
        1 to: self size
                do:
                        [:col |
                        colChar := self at: col.
                        1 to: aString size
                                do:
                                        [:row |
                                        rowChar := aString at: row.
                                        colChar = rowChar ifTrue: [cost := 0] ifFalse: [cost := 1].
                                        matrix at: col @ row
                                                put: ((matrix at: (col - 1) @ row) + 1 min: (matrix at: col @ (row
- 1)) + 1
                                                                min: (matrix at: (col - 1) @ (row - 1)) + cost)]].
^matrix at: (self size)@(aString size)


Reply | Threaded
Open this post in threaded view
|

Re: Levenstein Distance

Thomas Gagne-2
In reply to this post by Fernando Rodríguez
Smalltalk/X has an implementation.  You can download it at
<http://www.exept.de/>.

Fernando wrote:
> Hi,
>
> Before I go and reinvent the wheel, is there any implementation out
> there of the editing or Levenstein distance between 2 strings?
>
> Thanks
>


Reply | Threaded
Open this post in threaded view
|

Re: Levenstein Distance

Eliot Miranda
In reply to this post by Fernando Rodríguez
Here's a version of the code with a few minor optimizations and split
into a case-sensitive and case-insensitive pair...  For VW 7.x

Fernando wrote:
> Hi,
>
> Before I go and reinvent the wheel, is there any implementation out
> there of the editing or Levenstein distance between 2 strings?
>
> Thanks
>

--
_______________,,,^..^,,,____________________________
Eliot Miranda              Smalltalk - Scene not herd

<?xml version="1.0"?>

<st-source>
<time-stamp>From VisualWorks®, 7.3.1 of April 20, 2005 on July 22, 2005 at 10:38:28 am</time-stamp>


<methods>
<class-id>Core.CharacterArray</class-id> <category>comparing</category>

<body package="Collections-Text" selector="exactCaseDistanceFrom:">exactCaseDistanceFrom: comparand
        "Answer a Levenshtein distance between the receiver and some string.
         Given two strings, self and comparand, and four operations, adding,
         subtracting, exchanging or changing the case of single characters,
         answer the minimal number of steps needed to translate self into comparand.
                Algorithm description: http://www.merriampark.com/ld.htm.
         Unlike distanceFrom: this version does not ignore case differences."

        | matrix size compSize |
        size := self size.
        compSize := comparand size.
        matrix := Array new: size + 1.
        1 to: size + 1 do: [:i | | a |
                a := Array new: compSize + 1.
                a at: 1 put: i - 1.
                matrix at: i put: a].
        0 to: compSize do: [:i | (matrix at: 1) at: i + 1 put: i].
        1 to: size do: [:i |
                1 to: compSize do: [:j | | x y z cs cc |
                                x := ((matrix at: i) at: j + 1) + 1.
                                y := ((matrix at: i + 1) at: j) + 1.
                                "z := (self at: i) = (comparand at: j)
                                                ifTrue: [(matrix at: i) at: j]
                                                ifFalse: [((matrix at: i) at: j) + 1]."
                                cs := self at: i.
                                cc := comparand at: j.
                                z := (matrix at: i) at: j.
                                cs ~~ cc ifTrue:
                                        [z := z + (cs = cc
                                                                ifTrue: [0.25]
                                                                ifFalse: [1])].
                                (matrix at: i + 1)
                                        at: j + 1 put: ((x min: y) min: z)]].
        ^(matrix at: size + 1)
                at: compSize + 1


"#('zork' 'very long string' 'extremely long string if you squint at it') collect:
        [:v| | r vs |
        vs := v size asFloat.
        (Time millisecondsToRun:
                [r := Symbol allGeneralInstances inject: SmallInteger maxVal -&gt; nil into:
                        [:assoc :sym| | ss distance |
                        ((ss := sym size) &gt; 0
                        and: [vs / ss asFloat between: 0.5 and: 1.5])
                                ifTrue:
                                        [distance := sym distance: v.
                                         assoc key &lt; distance
                                                ifTrue: [assoc]
                                                ifFalse: [distance -&gt; sym]]
                                ifFalse: [assoc]]]) -&gt; r]"


"#('zork' 'very long string' 'extremely long string if you squint at it') collect:
        [:v| | r s vs |
        vs := v size.
        (Time millisecondsToRun:
                [s := Symbol allGeneralInstances inject: 0 -&gt; nil into:
                        [:assoc :sym| | score |
                        (score := sym spellAgainst: v) &gt; assoc key
                                ifTrue:
                                        [score -&gt; sym]
                                ifFalse: [assoc]]]) -&gt; s -&gt;
        (Time millisecondsToRun:
                [r := Symbol allGeneralInstances inject: SmallInteger maxVal -&gt; nil into:
                        [:assoc :sym| | ss distance |
                        ((ss := sym size) &gt; 0
                        and: [ss between: vs - 3 and: vs + 3])
                                ifTrue:
                                        [distance := sym distance: v.
                                         assoc key &lt; distance
                                                ifTrue: [assoc]
                                                ifFalse: [distance -&gt; sym]]
                                ifFalse: [assoc]]]) -&gt; r]"

"'zork' distance: 'fork'"
"'Fork' distance: 'fork'"</body>
</methods>

<methods>
<class-id>Core.CharacterArray</class-id> <category>comparing</category>

<body package="Collections-Text" selector="distanceFrom:">distanceFrom: comparand
        "Answer a Levenshtein distance between the receiver and some string.
         Given two strings, self and comparand, and four operations, adding,
         subtracting, exchanging or changing the case of single characters,
         answer the minimal number of steps needed to translate self into comparand.
                Algorithm description: http://www.merriampark.com/ld.htm"

        | matrix size compSize |
        size := self size.
        compSize := comparand size.
        matrix := Array new: size + 1.
        1 to: size + 1 do: [:i | | a |
                a := Array new: compSize + 1.
                a at: 1 put: i - 1.
                matrix at: i put: a].
        0 to: compSize do: [:i | (matrix at: 1) at: i + 1 put: i].
        1 to: size do: [:i |
                1 to: compSize do: [:j | | x y z cs cc |
                                x := ((matrix at: i) at: j + 1) + 1.
                                y := ((matrix at: i + 1) at: j) + 1.
                                "z := (self at: i) = (comparand at: j)
                                                ifTrue: [(matrix at: i) at: j]
                                                ifFalse: [((matrix at: i) at: j) + 1]."
                                cs := self at: i.
                                cc := comparand at: j.
                                z := (matrix at: i) at: j.
                                cs ~~ cc ifTrue:
                                        [z := z + (cs asLowercase = cc asLowercase
                                                                ifTrue: [0.25]
                                                                ifFalse: [1])].
                                (matrix at: i + 1)
                                        at: j + 1 put: ((x min: y) min: z)]].
        ^(matrix at: size + 1)
                at: compSize + 1


"#('zork' 'very long string' 'extremely long string if you squint at it') collect:
        [:v| | r vs |
        vs := v size asFloat.
        (Time millisecondsToRun:
                [r := Symbol allGeneralInstances inject: SmallInteger maxVal -&gt; nil into:
                        [:assoc :sym| | ss distance |
                        ((ss := sym size) &gt; 0
                        and: [vs / ss asFloat between: 0.5 and: 1.5])
                                ifTrue:
                                        [distance := sym distance: v.
                                         assoc key &lt; distance
                                                ifTrue: [assoc]
                                                ifFalse: [distance -&gt; sym]]
                                ifFalse: [assoc]]]) -&gt; r]"


"#('zork' 'very long string' 'extremely long string if you squint at it') collect:
        [:v| | r s vs |
        vs := v size.
        (Time millisecondsToRun:
                [s := Symbol allGeneralInstances inject: 0 -&gt; nil into:
                        [:assoc :sym| | score |
                        (score := sym spellAgainst: v) &gt; assoc key
                                ifTrue:
                                        [score -&gt; sym]
                                ifFalse: [assoc]]]) -&gt; s -&gt;
        (Time millisecondsToRun:
                [r := Symbol allGeneralInstances inject: SmallInteger maxVal -&gt; nil into:
                        [:assoc :sym| | ss distance |
                        ((ss := sym size) &gt; 0
                        and: [ss between: vs - 3 and: vs + 3])
                                ifTrue:
                                        [distance := sym distance: v.
                                         assoc key &lt; distance
                                                ifTrue: [assoc]
                                                ifFalse: [distance -&gt; sym]]
                                ifFalse: [assoc]]]) -&gt; r]"

"'zork' distance: 'fork'"
"'Fork' distance: 'fork'"</body>
</methods>

</st-source>
Reply | Threaded
Open this post in threaded view
|

Re: Levenstein Distance

Vassili Bykov-4
The problem with direct transcriptions of the algorithm is that they are
O(n^2) space, while it's not that hard to do it in O(n) space. You only
even use the current and the previous rows of the matrix.

Also, there's is no reason why in a language as nice as Smalltalk the
operation could not be part of the generic SequenceableCollection
protocol, allow any selector as comparison and allow to assign different
costs to insertion, deletion and substitution.

See 'Levenshtein Distance' in the public Store repository, for VW 7.x

--Vassili


Eliot Miranda wrote:

> Here's a version of the code with a few minor optimizations and split
> into a case-sensitive and case-insensitive pair...  For VW 7.x
>
> Fernando wrote:
>
>> Hi,
>>
>> Before I go and reinvent the wheel, is there any implementation out
>> there of the editing or Levenstein distance between 2 strings?
>>
>> Thanks
>>
>
>
> ------------------------------------------------------------------------
>
> <?xml version="1.0"?>
>
> <st-source>
> <time-stamp>From VisualWorks®, 7.3.1 of April 20, 2005 on July 22, 2005 at 10:38:28 am</time-stamp>
>
>
> <methods>
> <class-id>Core.CharacterArray</class-id> <category>comparing</category>
>
> <body package="Collections-Text" selector="exactCaseDistanceFrom:">exactCaseDistanceFrom: comparand
> "Answer a Levenshtein distance between the receiver and some string.
> Given two strings, self and comparand, and four operations, adding,
> subtracting, exchanging or changing the case of single characters,
> answer the minimal number of steps needed to translate self into comparand.
> Algorithm description: http://www.merriampark.com/ld.htm.
> Unlike distanceFrom: this version does not ignore case differences."
>
> | matrix size compSize |
> size := self size.
> compSize := comparand size.
> matrix := Array new: size + 1.
> 1 to: size + 1 do: [:i | | a |
> a := Array new: compSize + 1.
> a at: 1 put: i - 1.
> matrix at: i put: a].
> 0 to: compSize do: [:i | (matrix at: 1) at: i + 1 put: i].
> 1 to: size do: [:i |
> 1 to: compSize do: [:j | | x y z cs cc |
> x := ((matrix at: i) at: j + 1) + 1.
> y := ((matrix at: i + 1) at: j) + 1.
> "z := (self at: i) = (comparand at: j)
> ifTrue: [(matrix at: i) at: j]
> ifFalse: [((matrix at: i) at: j) + 1]."
> cs := self at: i.
> cc := comparand at: j.
> z := (matrix at: i) at: j.
> cs ~~ cc ifTrue:
> [z := z + (cs = cc
> ifTrue: [0.25]
> ifFalse: [1])].
> (matrix at: i + 1)
> at: j + 1 put: ((x min: y) min: z)]].
> ^(matrix at: size + 1)
> at: compSize + 1
>
>
> "#('zork' 'very long string' 'extremely long string if you squint at it') collect:
> [:v| | r vs |
> vs := v size asFloat.
> (Time millisecondsToRun:
> [r := Symbol allGeneralInstances inject: SmallInteger maxVal -&gt; nil into:
> [:assoc :sym| | ss distance |
> ((ss := sym size) &gt; 0
> and: [vs / ss asFloat between: 0.5 and: 1.5])
> ifTrue:
> [distance := sym distance: v.
> assoc key &lt; distance
> ifTrue: [assoc]
> ifFalse: [distance -&gt; sym]]
> ifFalse: [assoc]]]) -&gt; r]"
>
>
> "#('zork' 'very long string' 'extremely long string if you squint at it') collect:
> [:v| | r s vs |
> vs := v size.
> (Time millisecondsToRun:
> [s := Symbol allGeneralInstances inject: 0 -&gt; nil into:
> [:assoc :sym| | score |
> (score := sym spellAgainst: v) &gt; assoc key
> ifTrue:
> [score -&gt; sym]
> ifFalse: [assoc]]]) -&gt; s -&gt;
> (Time millisecondsToRun:
> [r := Symbol allGeneralInstances inject: SmallInteger maxVal -&gt; nil into:
> [:assoc :sym| | ss distance |
> ((ss := sym size) &gt; 0
> and: [ss between: vs - 3 and: vs + 3])
> ifTrue:
> [distance := sym distance: v.
> assoc key &lt; distance
> ifTrue: [assoc]
> ifFalse: [distance -&gt; sym]]
> ifFalse: [assoc]]]) -&gt; r]"
>
> "'zork' distance: 'fork'"
> "'Fork' distance: 'fork'"</body>
> </methods>
>
> <methods>
> <class-id>Core.CharacterArray</class-id> <category>comparing</category>
>
> <body package="Collections-Text" selector="distanceFrom:">distanceFrom: comparand
> "Answer a Levenshtein distance between the receiver and some string.
> Given two strings, self and comparand, and four operations, adding,
> subtracting, exchanging or changing the case of single characters,
> answer the minimal number of steps needed to translate self into comparand.
> Algorithm description: http://www.merriampark.com/ld.htm"
>
> | matrix size compSize |
> size := self size.
> compSize := comparand size.
> matrix := Array new: size + 1.
> 1 to: size + 1 do: [:i | | a |
> a := Array new: compSize + 1.
> a at: 1 put: i - 1.
> matrix at: i put: a].
> 0 to: compSize do: [:i | (matrix at: 1) at: i + 1 put: i].
> 1 to: size do: [:i |
> 1 to: compSize do: [:j | | x y z cs cc |
> x := ((matrix at: i) at: j + 1) + 1.
> y := ((matrix at: i + 1) at: j) + 1.
> "z := (self at: i) = (comparand at: j)
> ifTrue: [(matrix at: i) at: j]
> ifFalse: [((matrix at: i) at: j) + 1]."
> cs := self at: i.
> cc := comparand at: j.
> z := (matrix at: i) at: j.
> cs ~~ cc ifTrue:
> [z := z + (cs asLowercase = cc asLowercase
> ifTrue: [0.25]
> ifFalse: [1])].
> (matrix at: i + 1)
> at: j + 1 put: ((x min: y) min: z)]].
> ^(matrix at: size + 1)
> at: compSize + 1
>
>
> "#('zork' 'very long string' 'extremely long string if you squint at it') collect:
> [:v| | r vs |
> vs := v size asFloat.
> (Time millisecondsToRun:
> [r := Symbol allGeneralInstances inject: SmallInteger maxVal -&gt; nil into:
> [:assoc :sym| | ss distance |
> ((ss := sym size) &gt; 0
> and: [vs / ss asFloat between: 0.5 and: 1.5])
> ifTrue:
> [distance := sym distance: v.
> assoc key &lt; distance
> ifTrue: [assoc]
> ifFalse: [distance -&gt; sym]]
> ifFalse: [assoc]]]) -&gt; r]"
>
>
> "#('zork' 'very long string' 'extremely long string if you squint at it') collect:
> [:v| | r s vs |
> vs := v size.
> (Time millisecondsToRun:
> [s := Symbol allGeneralInstances inject: 0 -&gt; nil into:
> [:assoc :sym| | score |
> (score := sym spellAgainst: v) &gt; assoc key
> ifTrue:
> [score -&gt; sym]
> ifFalse: [assoc]]]) -&gt; s -&gt;
> (Time millisecondsToRun:
> [r := Symbol allGeneralInstances inject: SmallInteger maxVal -&gt; nil into:
> [:assoc :sym| | ss distance |
> ((ss := sym size) &gt; 0
> and: [ss between: vs - 3 and: vs + 3])
> ifTrue:
> [distance := sym distance: v.
> assoc key &lt; distance
> ifTrue: [assoc]
> ifFalse: [distance -&gt; sym]]
> ifFalse: [assoc]]]) -&gt; r]"
>
> "'zork' distance: 'fork'"
> "'Fork' distance: 'fork'"</body>
> </methods>
>
> </st-source>


--
Vassili Bykov
VisualWorks Engineering, Tools Technical Lead
v b y k o v  A T  c i n c o m  D O T  c o m
http://www.cincomsmalltalk.com/userblogs/vbykov


Reply | Threaded
Open this post in threaded view
|

Re: Levenstein Distance

Reinout Heeck
In reply to this post by Fernando Rodríguez
I just came across the work about computing a more general 'information
distance' by concatenating two strings and using standard compression
algorithms to compress the result. The size of the compressed string is
used to compute an information distance.

The applications blew my mind. This is *domain independent* code that can
apparently extract a lot of domain specific 'meaning', particularly when
combined with Google.


http://program.whatthehack.org/event/101.en.html

http://complearn.org/


This is probably also interesting as a support tool for the people involved
in Formal Concept Analysis (was that you Stef or Roel ?)



R
-