Fuzzy string comparison

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Fuzzy string comparison

Dmitry Zamotkin-2
Hello, all,

Does anybody know the best way to make, get or port subj?

Thanks in advance,
Dmitry Zamotkin


Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy string comparison

Dmitry Zamotkin-2
Hello, all

Here is code:

----------------
String >> distance: comparand
 "It calculates the following: Given two strings, a and b, and three
operations, adding, subtracting and exchanging single characters, what is
the minimal number of steps needed to translate a into b?"
 | matrix |

 matrix := LookupTable new.

 0 to: self size do: [:i | matrix at: i@0 put: i].
 0 to: comparand size do: [:i | matrix at: 0@i put: i].

 1 to: self size do: [ :i |
  1 to: comparand size do: [ :j | | x y z |
   x := ( matrix at: (i-1)@j ) + 1.
   y := ( matrix at: i@(j-1) ) + 1.

   (self at: i) = (comparand at: j)
    ifTrue: [ z := matrix at: (i-1)@(j-1) ]
    ifFalse: [ z := (matrix at: (i-1)@(j-1)) + 1 ].
   matrix at: i@j put: ((x min: y) min: z).
  ]
 ].

 ^ matrix at: self size @ comparand size
----------------

The method is based on the following idea:

  We want to find the distance between (self at: x) and (comparand at: y).
To do this, we  first calculate

  1) the distance between (self at: x-1) and (comparand at: y), adding the
cost of a subtract-operation, used to get from (self at: x) to (comparand
at: z-1);

  2) the distance between (self at: x) and (comparand at: y-1), adding the
cost of an addition-operation, used to get from (comparand at: y-1) to
(comparand at: y).

  3) the distance between (self at: x-1) and (comparand at: y-1), adding the
cost of a *possible* exchange of the letter (comparand at: y) (with (self
at: x)).

The cost of the subtraction and addition operations are 1, while the
exchange operation has a cost of 1 if (self at: x) and (comparand at: y) are
different, and 0 otherwise.

After calculating these costs, we choose the least one of them (since we
want to use the best solution.)

Instead of doing this recursively (i.e. calculating ourselves "back" from
the final value), we build a cost-matrix c containing the optimal
costs, so we can reuse them when calculating the later values. The costs
(matrix at: i@0) (from string of length n to empty string) are all i,
and correspondingly all (matrix at: 0@j) (from empty string to string of
length j) are j.

Finally, the cost of translating between the full strings a and b (matrix
at: self size@comparand size) is returned.

Have a good one,
Dmitry Zamotkin


Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy string comparison

Chris Uppal-3
Dmitry,

Seeing your edit-distance calculation reminded me of one I've used
before, by Miller & Myers published in Software Practice & Experience
back in '85.  I suspect that would be more efficient, in that it
wouldn't have to build values at every point in the array (except in
pathological cases).  I tried to find a web-reachable version of the
paper at: http://citeseer.nj.nec.com  (a wonderful resource, if you
don't know it already).

The bad news is that I couldn't find that article there; the good news
is that I did find several newer papers by the same authors and
others -- as far as I can see the latest (and, presumably best) is:

    http://www.cs.arizona.edu/people/gene/PAPERS/np_diff.ps

via Myers's CV at http://www.cs.arizona.edu/people/gene/vita.html

I haven't actually read the paper yet, so I can't really comment on the
algorithm.

BTW, I'm glad you mentioned this stuff, because while I was looking I
found a vein of papers that are directly relevant to a difficult open
problem at the place I work as well.

    -- chris

P.S. I do have C and Java versions of the algorithm from the original
paper which I'd be quite happy to email, however I suspect that they'd
be incomprehensible without the motivating text from the article.


Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy string comparison

Chris Uppal-2
In reply to this post by Dmitry Zamotkin-2
Dmitry,

> Here is code:

Oh, by the way, if you are willing to sacrifice clarity for speed, changing
the LookupTable to an Array and indexing into it using arithmetic
combination of the (i,j), then you can speed up that code by a factor of
over 450 (at least for the test I used on this machine).

I'll post, or email, code if anyone wants to see it, but -- as I said in an
earlier reply -- there are significantly more efficient algorithms
available, so it's probably better to start with them, and optimise their
representation if necessary.

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy string comparison

Dmitry Zamotkin-2
"Chris Uppal" <[hidden email]> wrote in
message news:[hidden email]...
> Oh, by the way, if you are willing to sacrifice clarity for speed,
changing
> the LookupTable to an Array and indexing into it using arithmetic
> combination of the (i,j), then you can speed up that code by a factor of
> over 450 (at least for the test I used on this machine).

Yes, I came to same idea but with an array of arrays ( more picturesque, I
guess :-)

>
> I'll post, or email, code if anyone wants to see it, but -- as I said in
an
> earlier reply -- there are significantly more efficient algorithms
> available, so it's probably better to start with them, and optimise their
> representation if necessary.

Unfortunately, suggested document contains another algorithm and I was not
able to find mentioned one. Please mail me C version of this algorithm so
I'll could implement Smalltalk version. Thanks.

>     -- chris

Dmitry Zamotkin
inbox(at)writeme(dot)com


Reply | Threaded
Open this post in threaded view
|

Re: Improved fuzzy string comparison

Dmitry Zamotkin-2
In reply to this post by Chris Uppal-2
String >> distance: comparand
 "It calculates the following: Given two strings, self and comparand, and
three operations, adding, subtracting and exchanging single characters, what
is the minimal number of steps needed to translate self into comparand?"
 | matrix size compSize |

 size := self size.
 compSize := comparand size.

 matrix := Array new: size + 1.
 1 to: size+1 do: [ :i | matrix at: i put: (Array new: compSize +1) ].

 0 to: size do: [:i | (matrix at: i+1) at: 1 put: i].
 0 to: compSize do: [:i | (matrix at: 1) at: i+1 put: i].

 1 to: size do: [ :i |
  1 to: compSize do: [ :j | | x y z |
   x := (( matrix at: i) at: j+1 ) + 1.
   y := (( matrix at: i+1) at: j ) + 1.

   (self at: i) = (comparand at: j)
    ifTrue: [ z := (matrix at: i) at: j ]
    ifFalse: [ z := ((matrix at: i) at: j) + 1 ].
   (matrix at: i+1) at: j+1 put: ((x min: y) min: z).
  ]
 ].

 ^ (matrix at: size+1 ) at: compSize+1


Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy string comparison

Chris Uppal-3
In reply to this post by Dmitry Zamotkin-2
Dmitry,

> Unfortunately, suggested document contains another algorithm and I was not
> able to find mentioned one.

I'm not sure what you mean.  That paper does contain a minimum-edit-distance
algorithm (unless I gave you the wrong reference!), it doesn't use quite the
same metric for distance as you have been doing -- it only considers
insertions and deletions and not "replacements".  However I wouldn't
normally think that was important -- there is no single "right" measure for
distance.  For instance should "movements" be considered ?  Neither your
algorithm nor the ones in the papers I mentioned do attempt to identify
movements, but there are algorithms that do so (I don't have any references,
though).

Actually I think the metric they use gives better results than one which
considers a replacement to have cost 1.  Consider

    abcdefg
    efghijk
and:
    tuvwxyz

An algorithm which considers only insertions and deletions will consider
'abcdefg' to be 8 distant from 'efghijk', and 14 distant from 'tuvwxyz'.  If
you include replacements (and make them cost 1 rather than 2) then 'abcdefg'
is equally distant (7) from both.

I'll include a Smalltalk version of the algorithm at the end of this
message.  One smallish point is that I define it against
SequenceableCollection, rather than String, since there's nothing about it
that makes it special for Strings.

One other point, the algorithm (like all M-E-D algorithms over an unknown
set of "symbols") can go O(N^2) in the worst case when the two inputs are
completely different.  It'd possible to check for this easily enough (and
it's be possible to give very efficient code for doing so for Strings and
ByteArrays using bitmaps) but I haven't bothered here.

Code follows, I hope the formatting isn't so badly screwed up as to be unreadable.

    -- chris

===============================
SequenceableCollection>>editDistanceWMMMTo: aComparand
 "answer the edit distance between the reciever and the <SequenceableCollection>, aComparand.
 The edit distance is the minimum number of <<insert>> or <<delete>> operations needed to turn
 one sequence into another.

 This is an implementation of the algorithm described in:

  An O(NP) Sequence Comparison Algorithm
  Sun Wu, Udi Manber, Gene Myers & Webb Miller
  Information Processing Letters, vol. 35, pp. 317-323, 1990.

 which can be found online at:

  http://www.cs.arizona.edu/people/gene/PAPERS/np_diff.ps

 I have followed their presentation of the algorithm almost exactly.

 The main change I've made is to ensure that array N is larger than array M, which the paper assumes.

 The other big change I've had to make is to the array 'fp', in the paper pseudo-code this has range
 [-(m+1), (N+1)]; for the Smalltalk implementation I've had to introduce an offset of M+2 to accommodate
 Smalltalk's 1-based arrays.  I've also introduced the variables M1, M2 and M3 to help pre-compute some
 of these indexes.

 Other differences are that their Greek 'DELTA' (the index of the target diagonal) is here called 'diag', and
 their Snake() subroutine is implemented as a block.  I've changed the names of its local variables, k, x and y,
 to make it clear that there is no aliasing with the main body of this routine"

 | A B M N diag snake fp p M1 M2 M3 |
#CUadded.

 self size <= aComparand size
  ifTrue:
   [A := self.
   B := aComparand]
  ifFalse:
   [A := aComparand.
   B := self].

 M := A size.
 N := B size.
 diag := N - M.
 snake :=
  [:kk :yy || xx |
  xx := yy - kk.
  [xx < M and: [yy < N and: [(A at: xx+1) = (B at: yy+1)]]]
   whileTrue: [xx := xx+1. yy := yy+1].
  yy].

 fp := Array new: N + M + 3.
 M1 := M+1.
 M2 := M+2.
 M3 := M+3.
 (-1-M) to: (N+1) do: [:i | fp at: i+M2 put: -1].

 p := -1.
 [p := p + 1.
 (0-p) to: (diag-1) do:
  [:k | fp at: k+M2 put: (snake value: k value: (((fp at: k+M1) + 1) max:
(fp at: k+M3)))].
 (diag+p) to: (diag+1) by: -1
  do: [:k | fp at: k+M2 put: (snake value: k value: (((fp at: k+M1) + 1)
max: (fp at: k+M3)))].
 fp at: diag+M2 put: (snake value: diag value: (((fp at: diag+M1) + 1) max:
(fp at: diag+M3))).
 (fp at: diag+M2) = N]
  whileFalse.

 ^ diag + (2 * p).
===============================


Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy string comparison

Dmitry Zamotkin-2
Chris,

"Chris Uppal" <[hidden email]> wrote in message
news:9dtlse$8mi$[hidden email]...
> Dmitry,

> I'm not sure what you mean.  That paper does contain a
minimum-edit-distance
> algorithm (unless I gave you the wrong reference!), it doesn't use quite
the
> same metric for distance as you have been doing -- it only considers
> insertions and deletions and not "replacements".

It was my fault :-(
I was too AGILE

> I'll include a Smalltalk version of the algorithm at the end of this
> message.  .

Thank you very much!

Dmitry