Hi,
Before I go and reinvent the wheel, is there any implementation out there of the editing or Levenstein distance between 2 strings? Thanks |
Fernando,
> Before I go and reinvent the wheel, is there any implementation out > there of the editing or Levenstein distance between 2 strings? See the C.L.S.D thread entitled "Fuzzy string comparison" dated around 10th May 2001. Google has it. -- chris |
In reply to this post by Fernando Rodríguez
On Tue, 19 Jul 2005 07:21:21 -0700, Fernando wrote:
> Hi, > > Before I go and reinvent the wheel, is there any implementation out > there of the editing or Levenstein distance between 2 strings? > > Thanks I believe it's "Levenshtein". If only you had some way of fuzzy matching, you'd probably find this page :-) http://www.merriampark.com/ld.htm No ST code, but nice clear implementations in other common languages. |
> > Before I go and reinvent the wheel, is there any implementation out
> > there of the editing or Levenstein distance between 2 strings? > > Before I go and reinvent the wheel, is there any implementation out > > there of the editing or Levenstein distance between 2 strings? Hi, http://www.beta4.com/mc/Fuzz-avi.9.mcz includes a Smalltalk implementation for Levenshtein Distance. You can load it into Squeak using the file list or adding http://www.beta4.com/mc/ as a Monticello HTTP repository. If you dont have Squeak available just open the file with your favourite ZIP program. There is an extension on String called #levenshteinDistanceFrom:. It was implemented by Avi Bryant (author of the Seaside Smalltalk web framework). Note that the code requires the class Array2D which can be found in the SqueakMap repository (direct link to latest version is http://map1.squeakfoundation.org/sm/accountbyid/a6930213-b578-49b1-862e-228cc5ab39e7/files/Array2D-md.1.mcz) (in Squeak just open the package loader and load the Array2D package) Bye Torsten |
Astares wrote:
> > > Before I go and reinvent the wheel, is there any implementation out > > > there of the editing or Levenstein distance between 2 strings? > > > > Before I go and reinvent the wheel, is there any implementation out > > > there of the editing or Levenstein distance between 2 strings? > > Hi, > > http://www.beta4.com/mc/Fuzz-avi.9.mcz includes a Smalltalk > implementation for Levenshtein Distance. If speed isn't too much of an issue, I've found the Ratcliff-Obershelp implementation in that package to be the most useful for doing fuzzy string matching in practice. Also potentially interesting is the ternary search tree (TSTree) implementation in my BTree package on SqueakMap, which acts as a dictionary mapping strings to values, but can do lookups of "similar" keys (to a specified distance) as well as prefix and exact matches. They're useful in combination: I use the TSTree's prefix and similarity matching to narrow the field, and then RatcliffObershelp to refine and sort the results. None of this should be very difficult to port to Smalltalks other than Squeak. Avi |
In reply to this post by Fernando Rodríguez
Fernando wrote:
> Before I go and reinvent the wheel, is there any implementation out > there of the editing or Levenstein distance between 2 strings? The following code is neither nice nor fast ... but it works ;-) CU, Udo ------------------------------------------------------------------------------------------------------- String>>levenshteinDistance: aString | matrix colChar rowChar cost stream | self isEmpty ifTrue: [^aString size]. aString isEmpty ifTrue: [^self size]. matrix := Dictionary new. 0 to: self size do: [:col | 0 to: aString size do: [:row | matrix at: col @ row put: 0]]. 0 to: self size do: [:col | matrix at: col @ 0 put: col]. 0 to: aString size do: [:row | matrix at: 0 @ row put: row]. 1 to: self size do: [:col | colChar := self at: col. 1 to: aString size do: [:row | rowChar := aString at: row. colChar = rowChar ifTrue: [cost := 0] ifFalse: [cost := 1]. matrix at: col @ row put: ((matrix at: (col - 1) @ row) + 1 min: (matrix at: col @ (row - 1)) + 1 min: (matrix at: (col - 1) @ (row - 1)) + cost)]]. ^matrix at: (self size)@(aString size) |
In reply to this post by Fernando Rodríguez
Smalltalk/X has an implementation. You can download it at
<http://www.exept.de/>. Fernando wrote: > Hi, > > Before I go and reinvent the wheel, is there any implementation out > there of the editing or Levenstein distance between 2 strings? > > Thanks > |
In reply to this post by Fernando Rodríguez
Here's a version of the code with a few minor optimizations and split
into a case-sensitive and case-insensitive pair... For VW 7.x Fernando wrote: > Hi, > > Before I go and reinvent the wheel, is there any implementation out > there of the editing or Levenstein distance between 2 strings? > > Thanks > -- _______________,,,^..^,,,____________________________ Eliot Miranda Smalltalk - Scene not herd <?xml version="1.0"?> <st-source> <time-stamp>From VisualWorks®, 7.3.1 of April 20, 2005 on July 22, 2005 at 10:38:28 am</time-stamp> <methods> <class-id>Core.CharacterArray</class-id> <category>comparing</category> <body package="Collections-Text" selector="exactCaseDistanceFrom:">exactCaseDistanceFrom: comparand "Answer a Levenshtein distance between the receiver and some string. Given two strings, self and comparand, and four operations, adding, subtracting, exchanging or changing the case of single characters, answer the minimal number of steps needed to translate self into comparand. Algorithm description: http://www.merriampark.com/ld.htm. Unlike distanceFrom: this version does not ignore case differences." | matrix size compSize | size := self size. compSize := comparand size. matrix := Array new: size + 1. 1 to: size + 1 do: [:i | | a | a := Array new: compSize + 1. a at: 1 put: i - 1. matrix at: i put: a]. 0 to: compSize do: [:i | (matrix at: 1) at: i + 1 put: i]. 1 to: size do: [:i | 1 to: compSize do: [:j | | x y z cs cc | x := ((matrix at: i) at: j + 1) + 1. y := ((matrix at: i + 1) at: j) + 1. "z := (self at: i) = (comparand at: j) ifTrue: [(matrix at: i) at: j] ifFalse: [((matrix at: i) at: j) + 1]." cs := self at: i. cc := comparand at: j. z := (matrix at: i) at: j. cs ~~ cc ifTrue: [z := z + (cs = cc ifTrue: [0.25] ifFalse: [1])]. (matrix at: i + 1) at: j + 1 put: ((x min: y) min: z)]]. ^(matrix at: size + 1) at: compSize + 1 "#('zork' 'very long string' 'extremely long string if you squint at it') collect: [:v| | r vs | vs := v size asFloat. (Time millisecondsToRun: [r := Symbol allGeneralInstances inject: SmallInteger maxVal -> nil into: [:assoc :sym| | ss distance | ((ss := sym size) > 0 and: [vs / ss asFloat between: 0.5 and: 1.5]) ifTrue: [distance := sym distance: v. assoc key < distance ifTrue: [assoc] ifFalse: [distance -> sym]] ifFalse: [assoc]]]) -> r]" "#('zork' 'very long string' 'extremely long string if you squint at it') collect: [:v| | r s vs | vs := v size. (Time millisecondsToRun: [s := Symbol allGeneralInstances inject: 0 -> nil into: [:assoc :sym| | score | (score := sym spellAgainst: v) > assoc key ifTrue: [score -> sym] ifFalse: [assoc]]]) -> s -> (Time millisecondsToRun: [r := Symbol allGeneralInstances inject: SmallInteger maxVal -> nil into: [:assoc :sym| | ss distance | ((ss := sym size) > 0 and: [ss between: vs - 3 and: vs + 3]) ifTrue: [distance := sym distance: v. assoc key < distance ifTrue: [assoc] ifFalse: [distance -> sym]] ifFalse: [assoc]]]) -> r]" "'zork' distance: 'fork'" "'Fork' distance: 'fork'"</body> </methods> <methods> <class-id>Core.CharacterArray</class-id> <category>comparing</category> <body package="Collections-Text" selector="distanceFrom:">distanceFrom: comparand "Answer a Levenshtein distance between the receiver and some string. Given two strings, self and comparand, and four operations, adding, subtracting, exchanging or changing the case of single characters, answer the minimal number of steps needed to translate self into comparand. Algorithm description: http://www.merriampark.com/ld.htm" | matrix size compSize | size := self size. compSize := comparand size. matrix := Array new: size + 1. 1 to: size + 1 do: [:i | | a | a := Array new: compSize + 1. a at: 1 put: i - 1. matrix at: i put: a]. 0 to: compSize do: [:i | (matrix at: 1) at: i + 1 put: i]. 1 to: size do: [:i | 1 to: compSize do: [:j | | x y z cs cc | x := ((matrix at: i) at: j + 1) + 1. y := ((matrix at: i + 1) at: j) + 1. "z := (self at: i) = (comparand at: j) ifTrue: [(matrix at: i) at: j] ifFalse: [((matrix at: i) at: j) + 1]." cs := self at: i. cc := comparand at: j. z := (matrix at: i) at: j. cs ~~ cc ifTrue: [z := z + (cs asLowercase = cc asLowercase ifTrue: [0.25] ifFalse: [1])]. (matrix at: i + 1) at: j + 1 put: ((x min: y) min: z)]]. ^(matrix at: size + 1) at: compSize + 1 "#('zork' 'very long string' 'extremely long string if you squint at it') collect: [:v| | r vs | vs := v size asFloat. (Time millisecondsToRun: [r := Symbol allGeneralInstances inject: SmallInteger maxVal -> nil into: [:assoc :sym| | ss distance | ((ss := sym size) > 0 and: [vs / ss asFloat between: 0.5 and: 1.5]) ifTrue: [distance := sym distance: v. assoc key < distance ifTrue: [assoc] ifFalse: [distance -> sym]] ifFalse: [assoc]]]) -> r]" "#('zork' 'very long string' 'extremely long string if you squint at it') collect: [:v| | r s vs | vs := v size. (Time millisecondsToRun: [s := Symbol allGeneralInstances inject: 0 -> nil into: [:assoc :sym| | score | (score := sym spellAgainst: v) > assoc key ifTrue: [score -> sym] ifFalse: [assoc]]]) -> s -> (Time millisecondsToRun: [r := Symbol allGeneralInstances inject: SmallInteger maxVal -> nil into: [:assoc :sym| | ss distance | ((ss := sym size) > 0 and: [ss between: vs - 3 and: vs + 3]) ifTrue: [distance := sym distance: v. assoc key < distance ifTrue: [assoc] ifFalse: [distance -> sym]] ifFalse: [assoc]]]) -> r]" "'zork' distance: 'fork'" "'Fork' distance: 'fork'"</body> </methods> </st-source> |
The problem with direct transcriptions of the algorithm is that they are
O(n^2) space, while it's not that hard to do it in O(n) space. You only even use the current and the previous rows of the matrix. Also, there's is no reason why in a language as nice as Smalltalk the operation could not be part of the generic SequenceableCollection protocol, allow any selector as comparison and allow to assign different costs to insertion, deletion and substitution. See 'Levenshtein Distance' in the public Store repository, for VW 7.x --Vassili Eliot Miranda wrote: > Here's a version of the code with a few minor optimizations and split > into a case-sensitive and case-insensitive pair... For VW 7.x > > Fernando wrote: > >> Hi, >> >> Before I go and reinvent the wheel, is there any implementation out >> there of the editing or Levenstein distance between 2 strings? >> >> Thanks >> > > > ------------------------------------------------------------------------ > > <?xml version="1.0"?> > > <st-source> > <time-stamp>From VisualWorks®, 7.3.1 of April 20, 2005 on July 22, 2005 at 10:38:28 am</time-stamp> > > > <methods> > <class-id>Core.CharacterArray</class-id> <category>comparing</category> > > <body package="Collections-Text" selector="exactCaseDistanceFrom:">exactCaseDistanceFrom: comparand > "Answer a Levenshtein distance between the receiver and some string. > Given two strings, self and comparand, and four operations, adding, > subtracting, exchanging or changing the case of single characters, > answer the minimal number of steps needed to translate self into comparand. > Algorithm description: http://www.merriampark.com/ld.htm. > Unlike distanceFrom: this version does not ignore case differences." > > | matrix size compSize | > size := self size. > compSize := comparand size. > matrix := Array new: size + 1. > 1 to: size + 1 do: [:i | | a | > a := Array new: compSize + 1. > a at: 1 put: i - 1. > matrix at: i put: a]. > 0 to: compSize do: [:i | (matrix at: 1) at: i + 1 put: i]. > 1 to: size do: [:i | > 1 to: compSize do: [:j | | x y z cs cc | > x := ((matrix at: i) at: j + 1) + 1. > y := ((matrix at: i + 1) at: j) + 1. > "z := (self at: i) = (comparand at: j) > ifTrue: [(matrix at: i) at: j] > ifFalse: [((matrix at: i) at: j) + 1]." > cs := self at: i. > cc := comparand at: j. > z := (matrix at: i) at: j. > cs ~~ cc ifTrue: > [z := z + (cs = cc > ifTrue: [0.25] > ifFalse: [1])]. > (matrix at: i + 1) > at: j + 1 put: ((x min: y) min: z)]]. > ^(matrix at: size + 1) > at: compSize + 1 > > > "#('zork' 'very long string' 'extremely long string if you squint at it') collect: > [:v| | r vs | > vs := v size asFloat. > (Time millisecondsToRun: > [r := Symbol allGeneralInstances inject: SmallInteger maxVal -> nil into: > [:assoc :sym| | ss distance | > ((ss := sym size) > 0 > and: [vs / ss asFloat between: 0.5 and: 1.5]) > ifTrue: > [distance := sym distance: v. > assoc key < distance > ifTrue: [assoc] > ifFalse: [distance -> sym]] > ifFalse: [assoc]]]) -> r]" > > > "#('zork' 'very long string' 'extremely long string if you squint at it') collect: > [:v| | r s vs | > vs := v size. > (Time millisecondsToRun: > [s := Symbol allGeneralInstances inject: 0 -> nil into: > [:assoc :sym| | score | > (score := sym spellAgainst: v) > assoc key > ifTrue: > [score -> sym] > ifFalse: [assoc]]]) -> s -> > (Time millisecondsToRun: > [r := Symbol allGeneralInstances inject: SmallInteger maxVal -> nil into: > [:assoc :sym| | ss distance | > ((ss := sym size) > 0 > and: [ss between: vs - 3 and: vs + 3]) > ifTrue: > [distance := sym distance: v. > assoc key < distance > ifTrue: [assoc] > ifFalse: [distance -> sym]] > ifFalse: [assoc]]]) -> r]" > > "'zork' distance: 'fork'" > "'Fork' distance: 'fork'"</body> > </methods> > > <methods> > <class-id>Core.CharacterArray</class-id> <category>comparing</category> > > <body package="Collections-Text" selector="distanceFrom:">distanceFrom: comparand > "Answer a Levenshtein distance between the receiver and some string. > Given two strings, self and comparand, and four operations, adding, > subtracting, exchanging or changing the case of single characters, > answer the minimal number of steps needed to translate self into comparand. > Algorithm description: http://www.merriampark.com/ld.htm" > > | matrix size compSize | > size := self size. > compSize := comparand size. > matrix := Array new: size + 1. > 1 to: size + 1 do: [:i | | a | > a := Array new: compSize + 1. > a at: 1 put: i - 1. > matrix at: i put: a]. > 0 to: compSize do: [:i | (matrix at: 1) at: i + 1 put: i]. > 1 to: size do: [:i | > 1 to: compSize do: [:j | | x y z cs cc | > x := ((matrix at: i) at: j + 1) + 1. > y := ((matrix at: i + 1) at: j) + 1. > "z := (self at: i) = (comparand at: j) > ifTrue: [(matrix at: i) at: j] > ifFalse: [((matrix at: i) at: j) + 1]." > cs := self at: i. > cc := comparand at: j. > z := (matrix at: i) at: j. > cs ~~ cc ifTrue: > [z := z + (cs asLowercase = cc asLowercase > ifTrue: [0.25] > ifFalse: [1])]. > (matrix at: i + 1) > at: j + 1 put: ((x min: y) min: z)]]. > ^(matrix at: size + 1) > at: compSize + 1 > > > "#('zork' 'very long string' 'extremely long string if you squint at it') collect: > [:v| | r vs | > vs := v size asFloat. > (Time millisecondsToRun: > [r := Symbol allGeneralInstances inject: SmallInteger maxVal -> nil into: > [:assoc :sym| | ss distance | > ((ss := sym size) > 0 > and: [vs / ss asFloat between: 0.5 and: 1.5]) > ifTrue: > [distance := sym distance: v. > assoc key < distance > ifTrue: [assoc] > ifFalse: [distance -> sym]] > ifFalse: [assoc]]]) -> r]" > > > "#('zork' 'very long string' 'extremely long string if you squint at it') collect: > [:v| | r s vs | > vs := v size. > (Time millisecondsToRun: > [s := Symbol allGeneralInstances inject: 0 -> nil into: > [:assoc :sym| | score | > (score := sym spellAgainst: v) > assoc key > ifTrue: > [score -> sym] > ifFalse: [assoc]]]) -> s -> > (Time millisecondsToRun: > [r := Symbol allGeneralInstances inject: SmallInteger maxVal -> nil into: > [:assoc :sym| | ss distance | > ((ss := sym size) > 0 > and: [ss between: vs - 3 and: vs + 3]) > ifTrue: > [distance := sym distance: v. > assoc key < distance > ifTrue: [assoc] > ifFalse: [distance -> sym]] > ifFalse: [assoc]]]) -> r]" > > "'zork' distance: 'fork'" > "'Fork' distance: 'fork'"</body> > </methods> > > </st-source> -- Vassili Bykov VisualWorks Engineering, Tools Technical Lead v b y k o v A T c i n c o m D O T c o m http://www.cincomsmalltalk.com/userblogs/vbykov |
In reply to this post by Fernando Rodríguez
I just came across the work about computing a more general 'information
distance' by concatenating two strings and using standard compression algorithms to compress the result. The size of the compressed string is used to compute an information distance. The applications blew my mind. This is *domain independent* code that can apparently extract a lot of domain specific 'meaning', particularly when combined with Google. http://program.whatthehack.org/event/101.en.html http://complearn.org/ This is probably also interesting as a support tool for the people involved in Formal Concept Analysis (was that you Stef or Roel ?) R - |
Free forum by Nabble | Edit this page |