Hello, all,
Does anybody know the best way to make, get or port subj? Thanks in advance, Dmitry Zamotkin |
Hello, all
Here is code: ---------------- String >> distance: comparand "It calculates the following: Given two strings, a and b, and three operations, adding, subtracting and exchanging single characters, what is the minimal number of steps needed to translate a into b?" | matrix | matrix := LookupTable new. 0 to: self size do: [:i | matrix at: i@0 put: i]. 0 to: comparand size do: [:i | matrix at: 0@i put: i]. 1 to: self size do: [ :i | 1 to: comparand size do: [ :j | | x y z | x := ( matrix at: (i-1)@j ) + 1. y := ( matrix at: i@(j-1) ) + 1. (self at: i) = (comparand at: j) ifTrue: [ z := matrix at: (i-1)@(j-1) ] ifFalse: [ z := (matrix at: (i-1)@(j-1)) + 1 ]. matrix at: i@j put: ((x min: y) min: z). ] ]. ^ matrix at: self size @ comparand size ---------------- The method is based on the following idea: We want to find the distance between (self at: x) and (comparand at: y). To do this, we first calculate 1) the distance between (self at: x-1) and (comparand at: y), adding the cost of a subtract-operation, used to get from (self at: x) to (comparand at: z-1); 2) the distance between (self at: x) and (comparand at: y-1), adding the cost of an addition-operation, used to get from (comparand at: y-1) to (comparand at: y). 3) the distance between (self at: x-1) and (comparand at: y-1), adding the cost of a *possible* exchange of the letter (comparand at: y) (with (self at: x)). The cost of the subtraction and addition operations are 1, while the exchange operation has a cost of 1 if (self at: x) and (comparand at: y) are different, and 0 otherwise. After calculating these costs, we choose the least one of them (since we want to use the best solution.) Instead of doing this recursively (i.e. calculating ourselves "back" from the final value), we build a cost-matrix c containing the optimal costs, so we can reuse them when calculating the later values. The costs (matrix at: i@0) (from string of length n to empty string) are all i, and correspondingly all (matrix at: 0@j) (from empty string to string of length j) are j. Finally, the cost of translating between the full strings a and b (matrix at: self size@comparand size) is returned. Have a good one, Dmitry Zamotkin |
Dmitry,
Seeing your edit-distance calculation reminded me of one I've used before, by Miller & Myers published in Software Practice & Experience back in '85. I suspect that would be more efficient, in that it wouldn't have to build values at every point in the array (except in pathological cases). I tried to find a web-reachable version of the paper at: http://citeseer.nj.nec.com (a wonderful resource, if you don't know it already). The bad news is that I couldn't find that article there; the good news is that I did find several newer papers by the same authors and others -- as far as I can see the latest (and, presumably best) is: http://www.cs.arizona.edu/people/gene/PAPERS/np_diff.ps via Myers's CV at http://www.cs.arizona.edu/people/gene/vita.html I haven't actually read the paper yet, so I can't really comment on the algorithm. BTW, I'm glad you mentioned this stuff, because while I was looking I found a vein of papers that are directly relevant to a difficult open problem at the place I work as well. -- chris P.S. I do have C and Java versions of the algorithm from the original paper which I'd be quite happy to email, however I suspect that they'd be incomprehensible without the motivating text from the article. |
In reply to this post by Dmitry Zamotkin-2
Dmitry,
> Here is code: Oh, by the way, if you are willing to sacrifice clarity for speed, changing the LookupTable to an Array and indexing into it using arithmetic combination of the (i,j), then you can speed up that code by a factor of over 450 (at least for the test I used on this machine). I'll post, or email, code if anyone wants to see it, but -- as I said in an earlier reply -- there are significantly more efficient algorithms available, so it's probably better to start with them, and optimise their representation if necessary. -- chris |
"Chris Uppal" <[hidden email]> wrote in
message news:[hidden email]... > Oh, by the way, if you are willing to sacrifice clarity for speed, changing > the LookupTable to an Array and indexing into it using arithmetic > combination of the (i,j), then you can speed up that code by a factor of > over 450 (at least for the test I used on this machine). Yes, I came to same idea but with an array of arrays ( more picturesque, I guess :-) > > I'll post, or email, code if anyone wants to see it, but -- as I said in an > earlier reply -- there are significantly more efficient algorithms > available, so it's probably better to start with them, and optimise their > representation if necessary. Unfortunately, suggested document contains another algorithm and I was not able to find mentioned one. Please mail me C version of this algorithm so I'll could implement Smalltalk version. Thanks. > -- chris Dmitry Zamotkin inbox(at)writeme(dot)com |
In reply to this post by Chris Uppal-2
String >> distance: comparand
"It calculates the following: Given two strings, self and comparand, and three operations, adding, subtracting and exchanging single characters, what is the minimal number of steps needed to translate self into comparand?" | matrix size compSize | size := self size. compSize := comparand size. matrix := Array new: size + 1. 1 to: size+1 do: [ :i | matrix at: i put: (Array new: compSize +1) ]. 0 to: size do: [:i | (matrix at: i+1) at: 1 put: i]. 0 to: compSize do: [:i | (matrix at: 1) at: i+1 put: i]. 1 to: size do: [ :i | 1 to: compSize do: [ :j | | x y z | x := (( matrix at: i) at: j+1 ) + 1. y := (( matrix at: i+1) at: j ) + 1. (self at: i) = (comparand at: j) ifTrue: [ z := (matrix at: i) at: j ] ifFalse: [ z := ((matrix at: i) at: j) + 1 ]. (matrix at: i+1) at: j+1 put: ((x min: y) min: z). ] ]. ^ (matrix at: size+1 ) at: compSize+1 |
In reply to this post by Dmitry Zamotkin-2
Dmitry,
> Unfortunately, suggested document contains another algorithm and I was not > able to find mentioned one. I'm not sure what you mean. That paper does contain a minimum-edit-distance algorithm (unless I gave you the wrong reference!), it doesn't use quite the same metric for distance as you have been doing -- it only considers insertions and deletions and not "replacements". However I wouldn't normally think that was important -- there is no single "right" measure for distance. For instance should "movements" be considered ? Neither your algorithm nor the ones in the papers I mentioned do attempt to identify movements, but there are algorithms that do so (I don't have any references, though). Actually I think the metric they use gives better results than one which considers a replacement to have cost 1. Consider abcdefg efghijk and: tuvwxyz An algorithm which considers only insertions and deletions will consider 'abcdefg' to be 8 distant from 'efghijk', and 14 distant from 'tuvwxyz'. If you include replacements (and make them cost 1 rather than 2) then 'abcdefg' is equally distant (7) from both. I'll include a Smalltalk version of the algorithm at the end of this message. One smallish point is that I define it against SequenceableCollection, rather than String, since there's nothing about it that makes it special for Strings. One other point, the algorithm (like all M-E-D algorithms over an unknown set of "symbols") can go O(N^2) in the worst case when the two inputs are completely different. It'd possible to check for this easily enough (and it's be possible to give very efficient code for doing so for Strings and ByteArrays using bitmaps) but I haven't bothered here. Code follows, I hope the formatting isn't so badly screwed up as to be unreadable. -- chris =============================== SequenceableCollection>>editDistanceWMMMTo: aComparand "answer the edit distance between the reciever and the <SequenceableCollection>, aComparand. The edit distance is the minimum number of <<insert>> or <<delete>> operations needed to turn one sequence into another. This is an implementation of the algorithm described in: An O(NP) Sequence Comparison Algorithm Sun Wu, Udi Manber, Gene Myers & Webb Miller Information Processing Letters, vol. 35, pp. 317-323, 1990. which can be found online at: http://www.cs.arizona.edu/people/gene/PAPERS/np_diff.ps I have followed their presentation of the algorithm almost exactly. The main change I've made is to ensure that array N is larger than array M, which the paper assumes. The other big change I've had to make is to the array 'fp', in the paper pseudo-code this has range [-(m+1), (N+1)]; for the Smalltalk implementation I've had to introduce an offset of M+2 to accommodate Smalltalk's 1-based arrays. I've also introduced the variables M1, M2 and M3 to help pre-compute some of these indexes. Other differences are that their Greek 'DELTA' (the index of the target diagonal) is here called 'diag', and their Snake() subroutine is implemented as a block. I've changed the names of its local variables, k, x and y, to make it clear that there is no aliasing with the main body of this routine" | A B M N diag snake fp p M1 M2 M3 | #CUadded. self size <= aComparand size ifTrue: [A := self. B := aComparand] ifFalse: [A := aComparand. B := self]. M := A size. N := B size. diag := N - M. snake := [:kk :yy || xx | xx := yy - kk. [xx < M and: [yy < N and: [(A at: xx+1) = (B at: yy+1)]]] whileTrue: [xx := xx+1. yy := yy+1]. yy]. fp := Array new: N + M + 3. M1 := M+1. M2 := M+2. M3 := M+3. (-1-M) to: (N+1) do: [:i | fp at: i+M2 put: -1]. p := -1. [p := p + 1. (0-p) to: (diag-1) do: [:k | fp at: k+M2 put: (snake value: k value: (((fp at: k+M1) + 1) max: (fp at: k+M3)))]. (diag+p) to: (diag+1) by: -1 do: [:k | fp at: k+M2 put: (snake value: k value: (((fp at: k+M1) + 1) max: (fp at: k+M3)))]. fp at: diag+M2 put: (snake value: diag value: (((fp at: diag+M1) + 1) max: (fp at: diag+M3))). (fp at: diag+M2) = N] whileFalse. ^ diag + (2 * p). =============================== |
Chris,
"Chris Uppal" <[hidden email]> wrote in message news:9dtlse$8mi$[hidden email]... > Dmitry, > I'm not sure what you mean. That paper does contain a minimum-edit-distance > algorithm (unless I gave you the wrong reference!), it doesn't use quite the > same metric for distance as you have been doing -- it only considers > insertions and deletions and not "replacements". It was my fault :-( I was too AGILE > I'll include a Smalltalk version of the algorithm at the end of this > message. . Thank you very much! Dmitry |
Free forum by Nabble | Edit this page |