Data Cleansing Alogrithms in Smalltalk

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Data Cleansing Alogrithms in Smalltalk

Günther Schmidt
Hi,

I need to import data from an access file and eliminate duplicate records.

Does anybody know of "Data cleansing" algos in Smalltalk?

Günther

Reply | Threaded
Open this post in threaded view
|

Re: Data Cleansing Alogrithms in Smalltalk

Reinout Heeck

On Jun 17, 2006, at 11:53 AM, Günther Schmidt wrote:

> Hi,
>
> I need to import data from an access file and eliminate duplicate  
> records.
>
> Does anybody know of "Data cleansing" algos in Smalltalk?
>
> Günther
>
>


The simple answer is to put your records in a Set.

If you have a Record class for your data and implement #= and #hash  
then all should 'just work'.

records := (MyRecordStream onAccessFile: 'foo.dat') contents.
uniqueRecords := records asSet asOrderedCollection.




HTH,

Reinout
-------






Reply | Threaded
Open this post in threaded view
|

Re: Data Cleansing Alogrithms in Smalltalk

"Hernán Morales"
In reply to this post by Günther Schmidt
Well, I have a simplified implementation for
VisualSmalltalk of two ambiguous matching algorithm
based on Baeza-Yates, R. A. and Gonnet, G.H., "A new
approach to text searching", and Wu, S., and Manber,
U., "Fast text searching allowing errors" (search in
the ACM.org for details, I don't have the links at
hand right now).
I think maybe it could be useful for data cleansing,
and it should be easy to port to VisualWorks.

I can't post the link because I'm finishing my
website, however you can send me an e-mail if you're
interested.

Hernán

"Fast text searching allowing errors"

--- Günther Schmidt <[hidden email]> wrote:

> Hi,
>
> I need to import data from an access file and
> eliminate duplicate records.
>
> Does anybody know of "Data cleansing" algos in
> Smalltalk?
>
> Günther
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

Reply | Threaded
Open this post in threaded view
|

Re: Data Cleansing Alogrithms in Smalltalk

Reinout Heeck-2
Hernán Morales wrote:

> Well, I have a simplified implementation for
> VisualSmalltalk of two ambiguous matching algorithm
> based on Baeza-Yates, R. A. and Gonnet, G.H., "A new
> approach to text searching", and Wu, S., and Manber,
> U., "Fast text searching allowing errors" (search in
> the ACM.org for details, I don't have the links at
> hand right now).
> I think maybe it could be useful for data cleansing,
> and it should be easy to port to VisualWorks.
>
> I can't post the link because I'm finishing my
> website, however you can send me an e-mail if you're
> interested.


Please consider publishing it into the Cincom open Store repository,
that way others can easily use it and also publish back any enhancements.

I know WikiWorks could use a better diff algorithm ;-)



R
-