New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SR-3330] inefficiencies in Set / Dictionary use of hash #45918
Comments
@mattneub In hashed collections, collisions can occur whenever the number of buckets is smaller than the keyspace. When you're creating a new Set without specifying a minimum capacity, the set might have only one bucket, so when you're inserting the second element, a collision occurs. The insert method will then decide if the storage should be grown, using something called a load factor. If the storage was grown, the existing elements have to be migrated over to the new storage buffer. That's when you're seeing all those extra calls to hashValue when inserting 4. The reason you're still seeing more calls to == than you would expect if the number of buckets is equal or higher to the number of elements has to do with an implementation detail of the bucket index calculation. The bits of the hashValue are mixed or "shuffled" before the modulo operation. This is to cut down on excessive collisions for types with bad hash algorithms. |
Comment by Alexis Beingessner (JIRA) What's actually happening:
So what you're seeing is:
|
Beingessner (JIRA User) Thanks so much. Valuable indeed (and eye-opening). It is very kind of you to give such full and precise information. |
The talkative struct shows how useful the initializers that reserve capacity are—all the resizing is gone if you create 's' with 'Set.init(minimumCapacity:)': var t = Set<S>(minimumCapacity: 5)
for i in 1...5 {
print("inserting", i)
t.insert(S(i))
}
inserting 1
called hashValue for 1
inserting 2
called hashValue for 2
inserting 3
called hashValue for 3
inserting 4
called hashValue for 4
called == for 3 4
called == for 1 4
inserting 5
called hashValue for 5 |
To prevent accidentally quadratic behavior when merging dictionaries/sets (as well as an extra layer of protection against HashDoS attacks), each Set/Dictionary now uses a different hash seed, and this seed also needs to change whenever the hash table is resized. This requires rehashing every key whenever we resize the table. This means this is now desirable/expected behavior; we cannot change it without incurring a large performance regression. (As well as an ABI break, since hash table layouts are part of the stdlib ABI.) |
Environment
Xcode 8.1
Additional Detail from JIRA
md5: a3d21b1bbf412f128589860284735c58
Issue Description:
This arises from a question posed on Stack Overflow: http://stackoverflow.com/questions/31665802/
In this simple example I've instrumented a rudimentary struct's `hashValue` and `==` so that we know when they are called:
We then create an empty Set and insert five unequal instances of that struct into it:
What I expect to see is that the `hashValue` for each struct instance instance should be called once, on insertion, and that's all. The `hashValue` never changes, so there would be no reason to call it more than once; the Set should be keeping track of the hash values of the already present elements. Moreover, `==` should never be called, because no two hash values are the same (there are no "collisions").
But that is not at all what we do see. Here's the output (I've added spaces):
I am not a computer scientist, but I find this mystifying. Not only is `==` called many times; it is called multiple times for the same pairs as part of the same insertion ("inserting 4" is the one I'm thinking of).
So the question is: are we non-computer-scientist types just misreading the evidence, or does this evidence in fact show that the implementation of `insert` is grossly inefficient? (And we have the same question about Dictionary, which uses the same algorithms and behaves the same way.)
The text was updated successfully, but these errors were encountered: