I lack the time to parse this in relation to VW, so I throw out some questions to more knowledgeable people here. "Efficient Denial of Service Attacks on Web Application Platforms" http://events.ccc.de/congress/2011/Fahrplan/attachments/2007_28C3_Effective_DoS_on_web_application_platforms.pdf The short version: if you hand a POST request to a web server it will typically keep the names of the POST variables in a hashed collection. If you know the hashing algorithm you can craft these names such that all of them have the same hash. Put several million onto a single POST request and the web server will churn for minutes on a single request. This DOS attack can also target other hashed collections like element names in a parsed xml document. Question 1: is the VW hashing algorithm for strings structured such that it is cheap to generate millions of strings with the same hash value? Question 2: can we generically counter such an attack? The paper proposes random hash seeding which seems inappropriate for image-based development. Should we consider using a new hash function that is (cryptographically) one-way? R - --
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
On Wed, Jan 4, 2012 at 7:59 AM, Reinout Heeck <[hidden email]> wrote:
Can you not determine an upper limit for the number of variables in a valid POST and reject any that contain more variables than the limit?
best, Eliot _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck-2
Well that's clever.
Without looking at it in too much detail, the VW algorithm is not that
complicated, so I imagine it would be possible to engineer many strings
with the same hash. And post values are kept in dictionaries both in
Seaside and in WebToolkit/VisualWave, so I suspect that attack would
work. The server will, at least in the WebToolkit case, attempt to kill
requests if the server is being bogged down, but if there are enough of
them thrashing it, that probably wouldn't help. The obvious generic
counter would be to use a sorted dictionary that used binary search
rather than a hash table. But if this relies on putting millions of
strings, and the strings are probably not short, filtering all post
requests that had, say, more than a thousand parameters might be a
generic counter. It would be hard to filter on request size, because you
might have one post parameter that was many megabytes as a file upload.
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck-2
Cross posted here, with a couple edits.
-------- Original Message -------- Subject: Re: [Pharo-project] string hash collisions? Date: Mon, 2 Jan 2012 00:00:35 -0500 From: Andres Valloud <[hidden email]> Reply-To: [hidden email] <[hidden email]>, [hidden email] <[hidden email]> To: [hidden email] <[hidden email]> Part of why those attacks are so easy is that the factor for the multiplicative hash function [they attacked] is very poor (i.e.: 33, 31, etc). There is nothing spectacular about this reasoning, those "cool looking" factors chosen using the digital (finger) method ["finger method" is the original K&R verbiage] should have been stamped out a long time ago. The hashing book I wrote has a ton of additional details on how to choose a factor properly if you want to use multiplicative hash functions, see here: http://www.lulu.com/spotlight/avsmalltalkbooks Of course you can avoid all that trouble by using a crypto hash, at the cost of the (mostly unnecessary) crypto hashing. The authors suggest using a randomized hash function, but how do persistent hash tables survive across invocations of the randomizer? Moreover, I suspect the following: 1. The multiplicative hash function they show as "improved" in Perl 5.8.1 seems to use effectively 1025 as a factor which is generally terrible (slide 77). 2. How is the seed determined? What criteria are used to guarantee it is a good seed value? 3. It is likely you can still abuse the hash function with series of null bytes, or by constructing strings that abuse that poor hash function into producing tons of collisions mod 2^x, x < 32. To sum it up... 1. If you use multiplicative hash functions, then you must choose the factor properly [to avoid extremely easy to find collisions in practice]. Go read the literature. 2. If you are expecting attackers, consider making it difficult to find collisions in your hash function. A way to make it REALLY hard is to use a crypto hash. But the applicability depends on the application. Generally speaking, making something like a crypto hash function the default hash function is a bad idea. 3. Just "randomizing" a hash function [as suggested by the slides] can be as useful as "randomizing" a random number generator. Obfuscation is not equivalent to a fix. Without a clear understanding of why the results have to be good, chances are the results will be poor. On 12/30/11 7:58 , Philippe Marschall wrote: > Hi > > As you probably noted string hash collisions are all the rage these days > [1]. Has anybody looked into whether this applies to Pharo as well? > > [1] http://events.ccc.de/congress/2011/Fahrplan/events/4680.en.html > > Cheers > Philippe > > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck-2
> Question 1: is the VW hashing algorithm for strings structured such that
> it is cheap to generate millions of strings with the same hash value? I don't think it's THAT cheap, but with a 28 bit hash value and today's computers, a motivated attacker won't have much trouble brute forcing whatever number of collisions needed. > Question 2: can we generically counter such an attack? The paper > proposes random hash seeding which seems inappropriate for image-based > development. Should we consider using a new hash function that is > (cryptographically) one-way? Yes, applications that expect attackers can e.g.: subclass Dictionary and replace the send of #hash with the crypto hash that suits their needs. Using something like a crypto hash as a default hash for the whole system is going to be extremely painful for performance, because most uses don't need a crypto hash and its associated expense. Generally speaking, crypto hashes tend to be much more costly than non-crypto hashes. Andres. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Eliot Miranda-2
On 1/4/2012 8:31 PM, Eliot Miranda wrote:
Hi Eliot! The POST variables is merely one collection open to abuse. Consider that: I could attack by giving your web server many HTTP headers with names that generate hash clashes. I could send a SOAP payload that is crafted such that a dictionary in the XML parser is attacked. If you use VW to read email I could send you a specially crafted email so the mime parser will churn over hash clashes. I could publish some stuff into the open Store repository such that all Store clients will churn over hash clashes when browsing the repository. I could open an SSL connection to a VW server and present it a certificate that is crafted to make the server churn. etc, etc, etc... I'm not sure whether all of these scenarios are actually possible, however finding that out would require research effort. Hence my inclination to look for a generic fix, it seems impossible/error prone to fix all separate use cases. On a global level we might be able to detect when a hashed collection has a certain percentage of identical hashes (>50% ?) and then raise an exception. Could we do this cheaply though? Alternatively we could switch to cryptographically more secure hashes but that would be excessively expensive.... R - --
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Andres Valloud-6
On 1/5/2012 1:00 AM, Andres Valloud wrote:
Doesn't this imply that a cryptographic hash algorithm will be unsecure anyway unless we allow hash values bigger that 28 bits?Question 1: is the VW hashing algorithm for strings structured such that it is cheap to generate millions of strings with the same hash value?I don't think it's THAT cheap, but with a 28 bit hash value and today's computers, a motivated attacker won't have much trouble brute forcing whatever number of collisions needed. Hmm, that seems to imply I'll have to give up the abstraction that the Smalltalk library gave me.Question 2: can we generically counter such an attack? The paper proposes random hash seeding which seems inappropriate for image-based development. Should we consider using a new hash function that is (cryptographically) one-way?Yes, applications that expect attackers can e.g.: subclass Dictionary and replace the send of #hash with the crypto hash that suits their needs. Using something like a crypto hash as a default hash for the whole system is going to be extremely painful for performance, because most uses don't need a crypto hash and its associated expense. I will need to know whether a piece of code that my server uses makes use of 'unsafe' hashing, I will need to know deep details of the libraries I use -> no abstraction. Alternatively all libraries will need to be coded using a 'safe' variant of hashed collection so I can trust my server is safe against this attack. Generally speaking, crypto hashes tend to be much more costly than non-crypto hashes. Andres. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck-2
"Reinout Heeck"<[hidden email]> wrote:
> >> Question 1: is the VW hashing algorithm for strings structured such that > >> it is cheap to generate millions of strings with the same hash value? > > I don't think it's THAT cheap, but with a 28 bit hash value and today's > > computers, a motivated attacker won't have much trouble brute forcing > > whatever number of collisions needed. > Doesn't this imply that a cryptographic hash algorithm will be unsecure > anyway unless we allow hash values bigger that 28 bits? Depends on what you mean by "cryptographic", but the definition of "cryptographically secure" hash function requires it to be "one-way" and "collision-resistant", the latter meaning that the amount of work needed to find a collision makes it infeasible. I don't see how you can achieve that with 28 bits. Taking birthday paradox into account you'll need at most 16K samples (on average) to find a collision no matter how sophisticated the function is. So unless a single hash operation is enormously computationally expensive (making it completely impractical), I don't see how it can be secure. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck-2
"Andres Valloud"<[hidden email]> wrote:
> The authors suggest using a randomized hash function, but how do > persistent hash tables survive across invocations of the randomizer? How about a "keyed hash" where a new key is generated for each hashed collection instance and persisted with it. The important requirement is that the key of any particular collection isn't predictable, so you may want a cryptographically secure random generator, but that's a cost paid once when the collection is created, you don't pay for that on every lookup. Could be feasible for reasonable number of use cases. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck-2
"Andres Valloud"<[hidden email]> wrote:
> The authors suggest using a randomized hash function, but how do > persistent hash tables survive across invocations of the randomizer? How about a "keyed hash" where a new key is generated for each hashed collection instance and persisted with it. The important requirement is that the key of any particular collection isn't predictable, so you may want a cryptographically secure random generator, but that's a cost paid once when the collection is created, you don't pay for that on every lookup. Could be feasible for reasonable number of use cases. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck-2
Reinout Heeck wrote:
Hmm, that seems to imply I'll have to give up the abstraction that the Smalltalk library gave me.Unfortunately that tends to be the nature of security problems, that something far down in a library you use turns out to have a problem with hashing, or buffer overflow, or what permissions it uses, etc. Lots of algorithm design tends to think about things in terms of a hypothetical adversary, but they don't tend to really build them with the assumption that a concrete adversary will actually show up on the virtual doorstep one day. In randomized algorithm design, one of the criteria is that although an algorithm may have theoretical worst-case performance, that worst-case should not depend on the input. That is, the randomized component might give worst-case performance in one run, but running with the same input again would not. I'm not sure if that can be applied to this sort of problem, but it would be a general solution. The other general solution would be to use data structures with better worst-case scenarios. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck-2
On 1/5/2012 8:30 AM, Reinout Heeck wrote:
> On 1/5/2012 1:00 AM, Andres Valloud wrote: >>> Question 1: is the VW hashing algorithm for strings structured such that >>> it is cheap to generate millions of strings with the same hash value? >> I don't think it's THAT cheap, but with a 28 bit hash value and today's >> computers, a motivated attacker won't have much trouble brute forcing >> whatever number of collisions needed. > Doesn't this imply that a cryptographic hash algorithm will be unsecure > anyway unless we allow hash values bigger that 28 bits? No, what I meant is that the current default non-crypto hash is 28 bits, so even if it is obfuscated you can still brute force it easily. With a crypto hash you'd have large integer hashes, but I wouldn't necessarily worry too much about it unless you're continuously serving several thousand requests per second (and then of course you'd have other serious concerns such as when the GC is allowed to run etc). > Alternatively all libraries will need to be coded using a 'safe' variant > of hashed collection so I can trust my server is safe against this attack. But that also runs the risks of adding a lot of complexity just in case someone needs it, and still not being enough for a specific application we're not yet aware of. Andres. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck-2
As a default hash, yes I'd say crypto hashes would be too expensive for
little to no gain. For this specific instance, however... I'm sure you can get a crypto hash of a reasonably sized string in a millisecond. How many requests per second are you handling? And how many milliseconds of network latency to the client using the app? On 1/5/2012 7:51 AM, Reinout Heeck wrote: > Alternatively we could switch to cryptographically more secure hashes > but that would be excessively expensive.... _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
On Thu, Jan 5, 2012 at 2:01 PM, Andres Valloud <[hidden email]> wrote:
As a default hash, yes I'd say crypto hashes would be too expensive for I wonder if we are chasing the wrong problem here. The original issue had to do with too many POST values sent as a seaside request. Most if not all of the hashing issues could be avoided by adding an upper bound (2k - 5k) to the number of values processed by the seaside request handler. Yes the overall size of the request will be quite big, but once handler passes the upper limit its easy to throw an exception and stop the processing. That way you never have to worry about hashing millions of bogus post values. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Yes, that too... how you approach the security issues you're facing
really depends on the application. On 1/5/2012 2:48 PM, Jon Paynter wrote: > On Thu, Jan 5, 2012 at 2:01 PM, Andres Valloud <[hidden email] > <mailto:[hidden email]>> wrote: > > As a default hash, yes I'd say crypto hashes would be too expensive for > little to no gain. For this specific instance, however... I'm sure you > can get a crypto hash of a reasonably sized string in a millisecond. How > many requests per second are you handling? And how many milliseconds of > network latency to the client using the app? > > > I wonder if we are chasing the wrong problem here. > The original issue had to do with too many POST values sent as a seaside > request. Most if not all of the hashing issues could be avoided by > adding an upper bound (2k - 5k) to the number of values processed by the > seaside request handler. Yes the overall size of the request will be > quite big, but once handler passes the upper limit its easy to throw an > exception and stop the processing. That way you never have to worry > about hashing millions of bogus post values. > > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck-2
>> I lack the time to parse this in relation to VW Sorry, but the explanation I'd give for how this relates to VW will take longer to parse... :) Hash collisions typically happen due to a combination of hash range and hash table size. The smallest of either of these typically determines the number of
collisions; however, VW is different. The example in the document shows a hash->bucket design in which the hash value is used to find a sequence that is slowly searched. The attack is to add unequal
objects of equal hash value to maximize the amount of linear search and growth in each hash bucket. An example of hash->bucket implementation in VW is the class-side behavior Symbol uses to intern strings. The trick would also be to minimize the hash variance
of the data added to reduce the chance that a rehash will increase the size of the hash table and that distribution into buckets will not be improved once the hash table size is increased.
VW hashed collections have logical hash buckets with search indexes that are determined by the size of the hash table. VW grows the ENTIRE hash table after
an addition when #fullCheck notices there are less than 25% free slots. VW uses the hash value as a starting point for doing a linear search that is limited only by the discovery of a nil slot or equal object. VW is more affected by the distribution of empty
slots than it is by hash table size. VW needs to have the free slots because without them the #find*OrNil: methods would have the cost of searching an increasing percentage of the hash table as items are added. The free slots serve as a stop that a separate
physical bucket would normally provide. This is more costly (like #fixCollisionsFrom:) than a physical hash bucket design that can manage buckets individually. VW attempts to expand the distribution range in methods like #initialIndexFor:boundedBy:, but that
would have little benefit when there are many equal hash values for unequal objects. VW has the linear search costs that the attack exploits, searches more than once for some operations, has a larger scope of growth, and searches through objects that would
otherwise belong to another bucket. The VW implementation would be more susceptible to a hash collision attack than the hash->bucket design that the document describes how to exploit. "This artificially forces a highly saturated hash table to simulate lookup cost alone that could be incurred with a hash collision attack ." | normalSet normalResults slowerSet slowerResults notIncluded | normalSet := Set new: Symbol tableSize. notIncluded := Set new: Symbol tableSize. Symbol table size to: 1 by: -1 do: [:bucketIndex |
(bucketIndex < 6 ifTrue: [notIncluded] ifFalse: [normalSet])
addAll: (Symbol table at: bucketIndex). ]. notIncluded := notIncluded asArray. normalResults := Array
with: normalSet basicSize
with: normalSet size
with: (normalSet basicSize / normalSet size) asFloat
with: (Time millisecondsToRun: [10 timesRepeat: [notIncluded do: [:ea | normalSet includes: ea]]]). slowerSet := normalSet copyEmpty: normalSet size + 1. normalSet do: [:each | slowerSet noCheckAdd: each]. slowerResults := Array
with: slowerSet basicSize
with: slowerSet size
with: (slowerSet basicSize / slowerSet size) asFloat with: (Time millisecondsToRun: [10 timesRepeat: [notIncluded do: [:ea | slowerSet includes: ea]]]). Array with: normalResults with: slowerResults #(#(280277 128057 2.18869 1) #(128431 128057 1.00292 1936)) The performance difference was 1 ms vs. 1936 ms. A hash collision attack would cause inefficient searches through a long series of overlapping "buckets" much like this code simulates. Because VW overlaps logical
buckets, a search that happens in a logical bucket position after an attacked bucket position would experience some of the cost that would otherwise be isolated to just one physical bucket. It wouldn't matter how many free slots VW grows the collection for
because the free spaces would be clustered distant from the start of the search from colliding items. Others have mentioned that the 28 bit computed hash range from a VW string may be of sufficient range to make it extremely difficult to do a string-based collision
attack. I agree it would be difficult to simulate an actual hash collision attack using a hashed collection of strings with 28 bits of hash value range; however, I suspect the assumption is being made that the hash values from other languages have less range
than 28 bits. I couldn't say one way or another. If 28 bits is enough range to be considered attack resistant then other languages could compute a 28 bit hash value from a string just like VW does.
Only a small variance of the hash value is needed to reduce a distributed "meet in the middle" collision attack. VW gets some automatic variance from the size
of the hash table. A physical buckets design doesn't need to adjust the size of the hash table as frequently as VW collections do but VW's more frequent growth of the hash table frequently can reduce some of the computed collisions from #initialIndexFor:boundedBy:.
Assuming that VW string's 28-bit hash range avoids the problem... The hash values for non-strings are not externally controllable in a way that could cause
a hash collision through passing data. For some, that will be the answer they are looking for, case closed. But hold on... While VW is generally safe from direct external attack by data, the performance problems that the attack hopes to exploit are so common
in VW that it is accepted as normal. Most objects in VW default to the 14 bit (16,383) range of values (#maximumIdentityHashValue) from #identityHash. This narrow range causes many smaller/distributed hash collisions. Many now recognize that the inefficiency
of VW hashed collections becomes more pronounced once it contains more than 16K items. Few recognize that the collision costs exist for smaller collections--because the slowness is pervasive and consistent.
In a sense, VW comes pre-attacked with the performance impact of the "meet in the middle" approach of the document. If VW's 28-bit hash range is exploitable
then the design VW uses for hashed collections would likely crash harder and faster than a design that doesn't have the problems I've described. The main difference is that VW would crash in acquiring more memory to grow collections while others might just
get very slow but linger on. The limited range that VW returns for #identityHash is one of the biggest performance problems that 32-bit VW has because it causes poor distribution into the
logical hash buckets. There are many ways to fix the problem by changing the collections or computations. One way to avoid the problem is to use a 64-bit VM that can offer a greater #maximumIdentityHashValue. Upgrading to 64-bit is easier than reimplementing
VW's hashed collections so it is the option Cincom would encourage. Regards, Paul Baumann From: [hidden email] [mailto:[hidden email]]
On Behalf Of Reinout Heeck
--
This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck-2
On Jan 4, 2012, at 7:59 AM, Reinout Heeck wrote:
Naive question from me. Can one step back and just ditch hashing altogether by using binary search tree collection? In this particular domain, it sounds like the keys are always strings. I wrote a simple non-balancing (maybe I'll fix that later, was reading on the scapegoat balancing algorithm and found it intriguing) tree, had some fun with polymorphism. Even with that naive approach, for a randomly distributed tree (meaning not worse case linear, but not best case best balance either), I was seeing times that were not that much slower than using a Dictionary. I did not figure out how to create a pathological hash collision. If someone wants to tell me how to compute 1000 strings with the same hash, I'll compare that. :) -- Travis Griggs Objologist "The best way to know you have a mind is to change it" -Judge Pierre Leval _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Paul Baumann
Thanks Paul, I always learn something new from your posts! I just spotted VW support Resolution 100401, which “implements Dictionaries using buckets to resolve hash collisions”. The resolution talks about improving the performance of finalization, which uses IdentityDictionary and is subject to the consecutive slot problem. It looks like the resolution just provides the new Dictionary classes, without actually changing existing code to use them. Hopefully still some use to others, and maybe a foundation to build on for a future solution integrated in VW – e.g. to replace IdentityDictionary. All the best, Steve From: [hidden email] [mailto:[hidden email]] On Behalf Of Paul Baumann >> I lack the time to parse this in relation to VW Sorry, but the explanation I'd give for how this relates to VW will take longer to parse... :) Hash collisions typically happen due to a combination of hash range and hash table size. The smallest of either of these typically determines the number of collisions; however, VW is different. The example in the document shows a hash->bucket design in which the hash value is used to find a sequence that is slowly searched. The attack is to add unequal objects of equal hash value to maximize the amount of linear search and growth in each hash bucket. An example of hash->bucket implementation in VW is the class-side behavior Symbol uses to intern strings. The trick would also be to minimize the hash variance of the data added to reduce the chance that a rehash will increase the size of the hash table and that distribution into buckets will not be improved once the hash table size is increased. VW hashed collections have logical hash buckets with search indexes that are determined by the size of the hash table. VW grows the ENTIRE hash table after an addition when #fullCheck notices there are less than 25% free slots. VW uses the hash value as a starting point for doing a linear search that is limited only by the discovery of a nil slot or equal object. VW is more affected by the distribution of empty slots than it is by hash table size. VW needs to have the free slots because without them the #find*OrNil: methods would have the cost of searching an increasing percentage of the hash table as items are added. The free slots serve as a stop that a separate physical bucket would normally provide. This is more costly (like #fixCollisionsFrom:) than a physical hash bucket design that can manage buckets individually. VW attempts to expand the distribution range in methods like #initialIndexFor:boundedBy:, but that would have little benefit when there are many equal hash values for unequal objects. VW has the linear search costs that the attack exploits, searches more than once for some operations, has a larger scope of growth, and searches through objects that would otherwise belong to another bucket. The VW implementation would be more susceptible to a hash collision attack than the hash->bucket design that the document describes how to exploit. "This artificially forces a highly saturated hash table to simulate lookup cost alone that could be incurred with a hash collision attack ." | normalSet normalResults slowerSet slowerResults notIncluded | normalSet := Set new: Symbol tableSize. notIncluded := Set new: Symbol tableSize. Symbol table size to: 1 by: -1 do: [:bucketIndex | (bucketIndex < 6 ifTrue: [notIncluded] ifFalse: [normalSet]) addAll: (Symbol table at: bucketIndex). ]. notIncluded := notIncluded asArray. normalResults := Array with: normalSet basicSize with: normalSet size with: (normalSet basicSize / normalSet size) asFloat with: (Time millisecondsToRun: [10 timesRepeat: [notIncluded do: [:ea | normalSet includes: ea]]]). slowerSet := normalSet copyEmpty: normalSet size + 1. normalSet do: [:each | slowerSet noCheckAdd: each]. slowerResults := Array with: slowerSet basicSize with: slowerSet size with: (slowerSet basicSize / slowerSet size) asFloat with: (Time millisecondsToRun: [10 timesRepeat: [notIncluded do: [:ea | slowerSet includes: ea]]]). Array with: normalResults with: slowerResults #(#(280277 128057 2.18869 1) #(128431 128057 1.00292 1936)) The performance difference was 1 ms vs. 1936 ms. A hash collision attack would cause inefficient searches through a long series of overlapping "buckets" much like this code simulates. Because VW overlaps logical buckets, a search that happens in a logical bucket position after an attacked bucket position would experience some of the cost that would otherwise be isolated to just one physical bucket. It wouldn't matter how many free slots VW grows the collection for because the free spaces would be clustered distant from the start of the search from colliding items. Others have mentioned that the 28 bit computed hash range from a VW string may be of sufficient range to make it extremely difficult to do a string-based collision attack. I agree it would be difficult to simulate an actual hash collision attack using a hashed collection of strings with 28 bits of hash value range; however, I suspect the assumption is being made that the hash values from other languages have less range than 28 bits. I couldn't say one way or another. If 28 bits is enough range to be considered attack resistant then other languages could compute a 28 bit hash value from a string just like VW does. Only a small variance of the hash value is needed to reduce a distributed "meet in the middle" collision attack. VW gets some automatic variance from the size of the hash table. A physical buckets design doesn't need to adjust the size of the hash table as frequently as VW collections do but VW's more frequent growth of the hash table frequently can reduce some of the computed collisions from #initialIndexFor:boundedBy:. Assuming that VW string's 28-bit hash range avoids the problem... The hash values for non-strings are not externally controllable in a way that could cause a hash collision through passing data. For some, that will be the answer they are looking for, case closed. But hold on... While VW is generally safe from direct external attack by data, the performance problems that the attack hopes to exploit are so common in VW that it is accepted as normal. Most objects in VW default to the 14 bit (16,383) range of values (#maximumIdentityHashValue) from #identityHash. This narrow range causes many smaller/distributed hash collisions. Many now recognize that the inefficiency of VW hashed collections becomes more pronounced once it contains more than 16K items. Few recognize that the collision costs exist for smaller collections--because the slowness is pervasive and consistent. In a sense, VW comes pre-attacked with the performance impact of the "meet in the middle" approach of the document. If VW's 28-bit hash range is exploitable then the design VW uses for hashed collections would likely crash harder and faster than a design that doesn't have the problems I've described. The main difference is that VW would crash in acquiring more memory to grow collections while others might just get very slow but linger on. The limited range that VW returns for #identityHash is one of the biggest performance problems that 32-bit VW has because it causes poor distribution into the logical hash buckets. There are many ways to fix the problem by changing the collections or computations. One way to avoid the problem is to use a 64-bit VM that can offer a greater #maximumIdentityHashValue. Upgrading to 64-bit is easier than reimplementing VW's hashed collections so it is the option Cincom would encourage. Regards, Paul Baumann From: [hidden email] [mailto:[hidden email]] On Behalf Of Reinout Heeck
--
This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by mkobetic
On 1/5/2012 6:31 PM, [hidden email] wrote:
"Andres Valloud"[hidden email] wrote:The authors suggest using a randomized hash function, but how do persistent hash tables survive across invocations of the randomizer?How about a "keyed hash" where a new key is generated for each hashed collection instance and persisted with it. ooh, that sounds like a useful idea ! Simple enough to be worth trying across the entire library too :-) R - The important requirement is that the key of any particular collection isn't predictable, so you may want a cryptographically secure random generator, but that's a cost paid once when the collection is created, you don't pay for that on every lookup. Could be feasible for reasonable number of use cases. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Andres Valloud-6
On 1/5/2012 11:01 PM, Andres Valloud wrote:
As a default hash, yes I'd say crypto hashes would be too expensive for little to no gain. For this specific instance, however... I'm sure you can get a crypto hash of a reasonably sized string in a millisecond. How many requests per second are you handling? And how many milliseconds of network latency to the client using the app? On 1/5/2012 7:51 AM, Reinout Heeck wrote:Alternatively we could switch to cryptographically more secure hashes but that would be excessively expensive.... Our app is relatively low traffic but handling high value. Given other thoughts in this thread I'm turning towards the idea that hashes should be unpredictable rather than cryptographically strong. The bitcoin mining scene shows how my current graphics card can compute 2^32 SHA-D256 hashes in about a minute, so 28-bits VW hash clashes should be trivial to generate with it. https://en.bitcoin.it/wiki/Mining_hardware_comparison#AMD_.28ATI.29 __ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |