Data Cleansing Alogrithms in Smalltalk

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Data Cleansing Alogrithms in Smalltalk

Günther Schmidt
Hi,

I need to import data from an access file and eliminate duplicate records.

Does anybody know of "Data cleansing" algos in Smalltalk?

Günther


Reply | Threaded
Open this post in threaded view
|

Re: Data Cleansing Alogrithms in Smalltalk

Udo Schneider
Günther Schmidt wrote:
> I need to import data from an access file and eliminate duplicate records.
>
> Does anybody know of "Data cleansing" algos in Smalltalk?
Are the records identical or just similar? In the second case I would
recommend a mixture of an alghorithm which takes into account how a word
sounds (e.g. (Double-)Metaphone or Soundex) and how is written (e.g.
Levenshtein distance).

Implementations of the Soundex and NYSIIS algorithms in Smalltalk can be
found here: http://www.nls.net/mp/jarvis/Bob/DolphinGoodies.htm

I should as well have implementations of Levenshtein distance and
(Double-)Metaphone somewhere in my archive.

If you are interested I could dig it out.

Otherwise you might find these things interesting:
http://en.wikipedia.org/wiki/Levenshtein
http://en.wikipedia.org/wiki/Metaphone
http://en.wikipedia.org/wiki/Double_Metaphone
http://en.wikipedia.org/wiki/Soundex
http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
http://www.codeproject.com/string/dmetaphone6.asp

Regards,

Udo


Reply | Threaded
Open this post in threaded view
|

Re: Data Cleansing Alogrithms in Smalltalk

Günther Schmidt
Udo,

thanks.

I was already aware of the possible solutions.

The database records are unfortunately pretty messy, so I'm looking for
a ST implementation of

"An Efficient Domain-Independent Algorithm for Detecting Approximately
Duplicate Database Records"

by Alvaro Monge, Charles Elkan.


I'll do some more research on this and then try to roll my own.

Günther


Udo Schneider schrieb:

> Günther Schmidt wrote:
>> I need to import data from an access file and eliminate duplicate
>> records.
>>
>> Does anybody know of "Data cleansing" algos in Smalltalk?
> Are the records identical or just similar? In the second case I would
> recommend a mixture of an alghorithm which takes into account how a word
> sounds (e.g. (Double-)Metaphone or Soundex) and how is written (e.g.
> Levenshtein distance).
>
> Implementations of the Soundex and NYSIIS algorithms in Smalltalk can be
> found here: http://www.nls.net/mp/jarvis/Bob/DolphinGoodies.htm
>
> I should as well have implementations of Levenshtein distance and
> (Double-)Metaphone somewhere in my archive.
>
> If you are interested I could dig it out.
>
> Otherwise you might find these things interesting:
> http://en.wikipedia.org/wiki/Levenshtein
> http://en.wikipedia.org/wiki/Metaphone
> http://en.wikipedia.org/wiki/Double_Metaphone
> http://en.wikipedia.org/wiki/Soundex
> http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System 
>
> http://www.codeproject.com/string/dmetaphone6.asp
>
> Regards,
>
> Udo
>
>