Smalltalk › Usenets › Dolphin Smalltalk

Data Cleansing Alogrithms in Smalltalk

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

3 messages Options

Günther Schmidt

Data Cleansing Alogrithms in Smalltalk

Hi,

I need to import data from an access file and eliminate duplicate records.

Does anybody know of "Data cleansing" algos in Smalltalk?

Günther

Udo Schneider

Re: Data Cleansing Alogrithms in Smalltalk

Günther Schmidt wrote:
> I need to import data from an access file and eliminate duplicate records.
>
> Does anybody know of "Data cleansing" algos in Smalltalk?
Are the records identical or just similar? In the second case I would
recommend a mixture of an alghorithm which takes into account how a word
sounds (e.g. (Double-)Metaphone or Soundex) and how is written (e.g.
Levenshtein distance).

Implementations of the Soundex and NYSIIS algorithms in Smalltalk can be
found here: http://www.nls.net/mp/jarvis/Bob/DolphinGoodies.htm

I should as well have implementations of Levenshtein distance and
(Double-)Metaphone somewhere in my archive.

If you are interested I could dig it out.

Otherwise you might find these things interesting:
http://en.wikipedia.org/wiki/Levenshtein
http://en.wikipedia.org/wiki/Metaphone
http://en.wikipedia.org/wiki/Double_Metaphone
http://en.wikipedia.org/wiki/Soundex
http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
http://www.codeproject.com/string/dmetaphone6.asp

Regards,

Udo

Günther Schmidt

Re: Data Cleansing Alogrithms in Smalltalk

Udo,

thanks.

I was already aware of the possible solutions.

The database records are unfortunately pretty messy, so I'm looking for
a ST implementation of

"An Efficient Domain-Independent Algorithm for Detecting Approximately
Duplicate Database Records"

by Alvaro Monge, Charles Elkan.

I'll do some more research on this and then try to roll my own.

Günther

Udo Schneider schrieb:

> Günther Schmidt wrote:
>> I need to import data from an access file and eliminate duplicate
>> records.
>>
>> Does anybody know of "Data cleansing" algos in Smalltalk?
> Are the records identical or just similar? In the second case I would
> recommend a mixture of an alghorithm which takes into account how a word
> sounds (e.g. (Double-)Metaphone or Soundex) and how is written (e.g.
> Levenshtein distance).
>
> Implementations of the Soundex and NYSIIS algorithms in Smalltalk can be
> found here: http://www.nls.net/mp/jarvis/Bob/DolphinGoodies.htm
>
> I should as well have implementations of Levenshtein distance and
> (Double-)Metaphone somewhere in my archive.
>
> If you are interested I could dig it out.
>
> Otherwise you might find these things interesting:
> http://en.wikipedia.org/wiki/Levenshtein
> http://en.wikipedia.org/wiki/Metaphone
> http://en.wikipedia.org/wiki/Double_Metaphone
> http://en.wikipedia.org/wiki/Soundex
> http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
>
> http://www.codeproject.com/string/dmetaphone6.asp
>
> Regards,
>
> Udo
>
>