[vwnc] Checking for duplicate elements in a collection

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[vwnc] Checking for duplicate elements in a collection

Ron Dobbelstein
Hi,
 
I have an OrderedCollection containing about 600.000 elements. It is possible there are duplicate elements. I would like to collect all the duplicate elements. My simple solution was: copy every element to a Set. This almost works: duplicates are eliminated, but I don't get an error or exception that I can catch in order to intercept the duplicates. Any ideas how to solve this?
 
TIA,
 
Ron
 
De informatie verzonden met dit emailbericht is uitsluitend bestemd voor de geadresseerde. Gebruik van deze informatie door anderen dan de geadresseerde is verboden. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van deze informatie aan derden is niet toegestaan. Afzender staat niet in voor de juiste en volledige overbrenging van de inhoud van een verzonden email, noch voor tijdige ontvangst daarvan. Afzender attendeert erop dat de vertrouwelijkheid van informatie verzonden per email niet gewaarborgd is.

The information contained in this communication is confidential and may be legally privileged. It is intended solely for the use of the individual or entity to whom it is addressed and others authorised to receive it. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. Sender is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. Please note that the confidentiality of e-mail communication is not warranted.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Checking for duplicate elements in a collection

Andres Valloud-4
With some performance penalty, you can do:

| withDuplicates unique duplicates |
withDuplicates := someOrderedCollection.
unique := Set new.
duplicates := OrderedCollection new.
withDuplicates do:
   [:each |
     (unique includes: each)
       ifTrue: [duplicates add: each]
       ifFalse: [unique add: each]
   ].
"do something with the results"

If you need this to run faster, you might want to try:

| withDuplicates unique duplicates |
withDuplicates := someOrderedCollection.
unique := Set new: withDuplicates size.
duplicates := OrderedCollection new.
withDuplicates do:
   [:each |
     (unique answerWhetherAdded: each)
       ifFalse: [duplicates add: each]
   ].
"do something with the results"

The answerWhetherAdded: method in Set (which you'd have to implement)
would try to add each, and answer whether the addition succeeded.  If it
does work, then there was no duplicate in the set.

Andres.


Ron Dobbelstein wrote:

> Hi,
>  
> I have an OrderedCollection containing about 600.000 elements. It is
> possible there are duplicate elements. I would like to collect all the
> duplicate elements. My simple solution was: copy every element to a Set.
> This almost works: duplicates are eliminated, but I don't get an error
> or exception that I can catch in order to intercept the duplicates. Any
> ideas how to solve this?
>  
> TIA,
>  
> Ron
>  
> De informatie verzonden met dit emailbericht is uitsluitend bestemd voor
> de geadresseerde. Gebruik van deze informatie door anderen dan de
> geadresseerde is verboden. Openbaarmaking, vermenigvuldiging,
> verspreiding en/of verstrekking van deze informatie aan derden is niet
> toegestaan. Afzender staat niet in voor de juiste en volledige
> overbrenging van de inhoud van een verzonden email, noch voor tijdige
> ontvangst daarvan. Afzender attendeert erop dat de vertrouwelijkheid van
> informatie verzonden per email niet gewaarborgd is.
>
> The information contained in this communication is confidential and may
> be legally privileged. It is intended solely for the use of the
> individual or entity to whom it is addressed and others authorised to
> receive it. If you are not the intended recipient you are hereby
> notified that any disclosure, copying, distribution or taking any action
> in reliance on the contents of this information is strictly prohibited
> and may be unlawful. Sender is neither liable for the proper and
> complete transmission of the information contained in this communication
> nor for any delay in its receipt. Please note that the confidentiality
> of e-mail communication is not warranted.
>
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Checking for duplicate elements in a collection

Travis Griggs-3
In reply to this post by Ron Dobbelstein

On Dec 3, 2009, at 6:42 AM, Ron Dobbelstein wrote:

> Hi,
>
> I have an OrderedCollection containing about 600.000 elements. It is  
> possible there are duplicate elements. I would like to collect all  
> the duplicate elements. My simple solution was: copy every element  
> to a Set. This almost works: duplicates are eliminated, but I don't  
> get an error or exception that I can catch in order to intercept the  
> duplicates. Any ideas how to solve this?

Would this work?

uniqueElements := Set new.
myBigCollection do:
                [:each |
                (uniqueElements includes: each)
                        ifTrue: [self doSomethingAboutTheDuplicateElement]
                        ifFalse: [uniqueElements add: each]]

--
Travis Griggs
Objologist
I patented thinking... and I still can't find anyone infringing.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Checking for duplicate elements in a collection

Alan Knight-2
In reply to this post by Ron Dobbelstein
Sort them on the complete list of attributes, or at least enough that duplicates should end up next to each other, then find elements where the next element is a duplicate.

At 09:42 AM 2009-12-03, Ron Dobbelstein wrote:
Content-Language: nl-NL
Content-Type: multipart/alternative;
         boundary="_000_A7B286AD3718544CA9759B273F31F01690267DD6ms002dm001netlo_"

Hi,
 
I have an OrderedCollection containing about 600.000 elements. It is possible there are duplicate elements. I would like to collect all the duplicate elements. My simple solution was: copy every element to a Set. This almost works: duplicates are eliminated, but I don't get an error or exception that I can catch in order to intercept the duplicates. Any ideas how to solve this?
 
TIA,
 
Ron
 
De informatie verzonden met dit emailbericht is uitsluitend bestemd voor de geadresseerde. Gebruik van deze informatie door anderen dan de geadresseerde is verboden. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van deze informatie aan derden is niet toegestaan. Afzender staat niet in voor de juiste en volledige overbrenging van de inhoud van een verzonden email, noch voor tijdige ontvangst daarvan. Afzender attendeert erop dat de vertrouwelijkheid van informatie verzonden per email niet gewaarborgd is.

The information contained in this communication is confidential and may be legally privileged. It is intended solely for the use of the individual or entity to whom it is addressed and others authorised to receive it. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. Sender is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. Please note that the confidentiality of e-mail communication is not warranted.
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

--
Alan Knight [|], Engineering Manager, Cincom Smalltalk

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Checking for duplicate elements in a collection

Winemiller, Chris

How about using a bag?

 

myCollection asBag valuesAndCounts do:

      [:anObject :count |

      count > 1 ifTrue: [“Do something about the duplicates here”]

 

Chris


From: [hidden email] [mailto:[hidden email]] On Behalf Of Alan Knight
Sent: Thursday, December 03, 2009 9:09 AM
To: Ron Dobbelstein; [hidden email]
Subject: Re: [vwnc] Checking for duplicate elements in a collection

 

Sort them on the complete list of attributes, or at least enough that duplicates should end up next to each other, then find elements where the next element is a duplicate.

At 09:42 AM 2009-12-03, Ron Dobbelstein wrote:

Content-Language: nl-NL
Content-Type: multipart/alternative;
         boundary="_000_A7B286AD3718544CA9759B273F31F01690267DD6ms002dm001netlo_"

Hi,
 
I have an OrderedCollection containing about 600.000 elements. It is possible there are duplicate elements. I would like to collect all the duplicate elements. My simple solution was: copy every element to a Set. This almost works: duplicates are eliminated, but I don't get an error or exception that I can catch in order to intercept the duplicates. Any ideas how to solve this?
 
TIA,
 
Ron
 
De informatie verzonden met dit emailbericht is uitsluitend bestemd voor de geadresseerde. Gebruik van deze informatie door anderen dan de geadresseerde is verboden. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van deze informatie aan derden is niet toegestaan. Afzender staat niet in voor de juiste en volledige overbrenging van de inhoud van een verzonden email, noch voor tijdige ontvangst daarvan. Afzender attendeert erop dat de vertrouwelijkheid van informatie verzonden per email niet gewaarborgd is.

The information contained in this communication is confidential and may be legally privileged. It is intended solely for the use of the individual or entity to whom it is addressed and others authorised to receive it. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. Sender is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. Please note that the confidentiality of e-mail communication is not warranted.
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc


--
Alan Knight [|], Engineering Manager, Cincom Smalltalk

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Checking for duplicate elements in a collection

Thomas Brodt
In reply to this post by Ron Dobbelstein
How about #asBag which collects the duplicates and #valuesAndCountsDo:
to enumerate over the collection?

Thomas


Ron Dobbelstein schrieb:

> Hi,
>  
> I have an OrderedCollection containing about 600.000 elements. It is
> possible there are duplicate elements. I would like to collect all the
> duplicate elements. My simple solution was: copy every element to a
> Set. This almost works: duplicates are eliminated, but I don't get an
> error or exception that I can catch in order to intercept the
> duplicates. Any ideas how to solve this?
>  
> TIA,
>  
> Ron
>  
> De informatie verzonden met dit emailbericht is uitsluitend bestemd
> voor de geadresseerde. Gebruik van deze informatie door anderen dan de
> geadresseerde is verboden. Openbaarmaking, vermenigvuldiging,
> verspreiding en/of verstrekking van deze informatie aan derden is niet
> toegestaan. Afzender staat niet in voor de juiste en volledige
> overbrenging van de inhoud van een verzonden email, noch voor tijdige
> ontvangst daarvan. Afzender attendeert erop dat de vertrouwelijkheid
> van informatie verzonden per email niet gewaarborgd is.
>
> The information contained in this communication is confidential and
> may be legally privileged. It is intended solely for the use of the
> individual or entity to whom it is addressed and others authorised to
> receive it. If you are not the intended recipient you are hereby
> notified that any disclosure, copying, distribution or taking any
> action in reliance on the contents of this information is strictly
> prohibited and may be unlawful. Sender is neither liable for the
> proper and complete transmission of the information contained in this
> communication nor for any delay in its receipt. Please note that the
> confidentiality of e-mail communication is not warranted.
> ------------------------------------------------------------------------
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>  
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Checking for duplicate elements in a collection

Henrik Sperre Johansen
In reply to this post by Ron Dobbelstein
<base href="x-msg://279/">How about 

aBag := Bag withAll: collectionWithDuplicates.
aBag valuesAndCountsDo: [:element :count |
count > 1 ifTrue: [self handle: count-1 duplicatesOf: element ].
collectionWithoutDuplicates addLast: element]


Cheers,
Henry

On Dec 3, 2009, at 3:42 45PM, Ron Dobbelstein wrote:

Hi,
 
I have an OrderedCollection containing about 600.000 elements. It is possible there are duplicate elements. I would like to collect all the duplicate elements. My simple solution was: copy every element to a Set. This almost works: duplicates are eliminated, but I don't get an error or exception that I can catch in order to intercept the duplicates. Any ideas how to solve this?
 
TIA,
 
Ron
 
De informatie verzonden met dit emailbericht is uitsluitend bestemd voor de geadresseerde. Gebruik van deze informatie door anderen dan de geadresseerde is verboden. Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van deze informatie aan derden is niet toegestaan. Afzender staat niet in voor de juiste en volledige overbrenging van de inhoud van een verzonden email, noch voor tijdige ontvangst daarvan. Afzender attendeert erop dat de vertrouwelijkheid van informatie verzonden per email niet gewaarborgd is.

The information contained in this communication is confidential and may be legally privileged. It is intended solely for the use of the individual or entity to whom it is addressed and others authorised to receive it. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. Sender is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. Please note that the confidentiality of e-mail communication is not warranted.
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Checking for duplicate elements in a collection

Holger Guhl
In reply to this post by Ron Dobbelstein
Hi Ron,
others have already posted the essential solution. This one is my stupid
trick to avoid performance penalty from the #includes: test:

| withDuplicates unique duplicates oldSize |
withDuplicates := someOrderedCollection.
unique := Set new: withDuplicates size * 3 // 2.
duplicates := OrderedCollection new.
oldSize := 0.
withDuplicates do:
   [:each | | newSize |
   unique add: each.
   (newSize := unique size) = oldSize
       ifTrue: [duplicates add: each]
       ifFalse: [oldSize := newSize]
   ].


Andres has already made the hint for a good performance of the unique
set. But Set>>new: does not allocate the extra "unused" space which is
required for good hashing containers' performance. Assuming that
duplicates are the exceptional case, I propose to ensure capacity for
all elements from the duplicate infected collection. One single #grow
(especially near exceeded capacity) can spoil the entire performance.
Cheers

Holger Guhl
--
Senior Consultant * Certified Scrum Master * [hidden email]
Tel: +49 231 9 75 99 21 * Fax: +49 231 9 75 99 20
Georg Heeg eK Dortmund
Handelsregister: Amtsgericht Dortmund  A 12812


Ron Dobbelstein schrieb:

> Hi,
>  
> I have an OrderedCollection containing about 600.000 elements. It is
> possible there are duplicate elements. I would like to collect all the
> duplicate elements. My simple solution was: copy every element to a
> Set. This almost works: duplicates are eliminated, but I don't get an
> error or exception that I can catch in order to intercept the
> duplicates. Any ideas how to solve this?
>  
> TIA,
>  
> Ron
>  
> De informatie verzonden met dit emailbericht is uitsluitend bestemd
> voor de geadresseerde. Gebruik van deze informatie door anderen dan de
> geadresseerde is verboden. Openbaarmaking, vermenigvuldiging,
> verspreiding en/of verstrekking van deze informatie aan derden is niet
> toegestaan. Afzender staat niet in voor de juiste en volledige
> overbrenging van de inhoud van een verzonden email, noch voor tijdige
> ontvangst daarvan. Afzender attendeert erop dat de vertrouwelijkheid
> van informatie verzonden per email niet gewaarborgd is.
>
> The information contained in this communication is confidential and
> may be legally privileged. It is intended solely for the use of the
> individual or entity to whom it is addressed and others authorised to
> receive it. If you are not the intended recipient you are hereby
> notified that any disclosure, copying, distribution or taking any
> action in reliance on the contents of this information is strictly
> prohibited and may be unlawful. Sender is neither liable for the
> proper and complete transmission of the information contained in this
> communication nor for any delay in its receipt. Please note that the
> confidentiality of e-mail communication is not warranted.
> ------------------------------------------------------------------------
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>  
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc