[SR-3330] inefficiencies in Set / Dictionary use of hash #45918

mattneub · 2016-12-05T00:57:21Z


Previous ID	SR-3330
Radar	None
Original Reporter	@mattneub
Type	Bug
Status	Resolved
Resolution	Won't Do

Environment

Xcode 8.1

Additional Detail from JIRA


Votes	1
Component/s	Standard Library
Labels	Bug
Assignee	None
Priority	Medium

md5: a3d21b1bbf412f128589860284735c58

Issue Description:

This arises from a question posed on Stack Overflow: http://stackoverflow.com/questions/31665802/

In this simple example I've instrumented a rudimentary struct's `hashValue` and `==` so that we know when they are called:

struct S : Hashable {
    static func ==(lhs:S,rhs:S) -> Bool {
        print("called == for", lhs.id, rhs.id)
        return lhs.id == rhs.id
    }
    let id : Int
    var hashValue : Int {
        print("called hashValue for", self.id)
        return self.id
    }
    init(_ id:Int) {self.id = id}
}

We then create an empty Set and insert five unequal instances of that struct into it:

var s = Set<S>()
for i in 1...5 {
    print("inserting", i)
    s.insert(S(i))
}

What I expect to see is that the `hashValue` for each struct instance instance should be called once, on insertion, and that's all. The `hashValue` never changes, so there would be no reason to call it more than once; the Set should be keeping track of the hash values of the already present elements. Moreover, `==` should never be called, because no two hash values are the same (there are no "collisions").

But that is not at all what we do see. Here's the output (I've added spaces):

inserting 1
called hashValue for 1

inserting 2
called hashValue for 2
called == for 1 2
called hashValue for 1
called hashValue for 2

inserting 3
called hashValue for 3

inserting 4
called hashValue for 4
called == for 3 4
called == for 1 4
called hashValue for 2
called hashValue for 3
called hashValue for 1
called hashValue for 4
called == for 3 4
called == for 1 4

inserting 5
called hashValue for 5

I am not a computer scientist, but I find this mystifying. Not only is `==` called many times; it is called multiple times for the same pairs as part of the same insertion ("inserting 4" is the one I'm thinking of).

So the question is: are we non-computer-scientist types just misreading the evidence, or does this evidence in fact show that the implementation of `insert` is grossly inefficient? (And we have the same question about Dictionary, which uses the same algorithms and behaves the same way.)

robinkunde · 2016-12-05T06:11:38Z

@mattneub In hashed collections, collisions can occur whenever the number of buckets is smaller than the keyspace. When you're creating a new Set without specifying a minimum capacity, the set might have only one bucket, so when you're inserting the second element, a collision occurs. The insert method will then decide if the storage should be grown, using something called a load factor. If the storage was grown, the existing elements have to be migrated over to the new storage buffer. That's when you're seeing all those extra calls to hashValue when inserting 4.

The reason you're still seeing more calls to == than you would expect if the number of buckets is equal or higher to the number of elements has to do with an implementation detail of the bucket index calculation. The bits of the hashValue are mixed or "shuffled" before the modulo operation. This is to cut down on excessive collisions for types with bad hash algorithms.

swift-ci · 2016-12-08T15:57:32Z

Comment by Alexis Beingessner (JIRA)

What's actually happening:

We hash a value only once on insertion.
We don't use hashes for comparison of elements, only ==. Using hashes for comparison is only reasonable if you store the hashes, but that means more memory usage for every Dictionary. A compromise that needs evaluation.
We try to insert the element before evaluating if the Dictionary can fit that element. This is because the element might already be in the Dictionary, in which case we don't need any more capacity.
When we resize the Dictionary, we have to rehash everything, because we didn't store the hashes.

So what you're seeing is:

one hash of the search key
some =='s (searching for a space)
hashes of every element in the collection (resize)
one hash of the search key (actually totally wasteful, but not a big deal considering it only happens after an O👎 reallocation)
some =='s (searching for a space in the new buffer)

mattneub · 2016-12-08T18:31:07Z

Beingessner (JIRA User) Thanks so much. Valuable indeed (and eye-opening). It is very kind of you to give such full and precise information.

natecook1000 · 2016-12-08T21:49:52Z

The talkative struct shows how useful the initializers that reserve capacity are—all the resizing is gone if you create 's' with 'Set.init(minimumCapacity:)':

var t = Set<S>(minimumCapacity: 5)
for i in 1...5 {  
    print("inserting", i)  
    t.insert(S(i)) 
} 
inserting 1
called hashValue for 1
inserting 2
called hashValue for 2
inserting 3
called hashValue for 3
inserting 4
called hashValue for 4
called == for 3 4
called == for 1 4
inserting 5
called hashValue for 5

lorentey · 2022-04-07T03:49:39Z

To prevent accidentally quadratic behavior when merging dictionaries/sets (as well as an extra layer of protection against HashDoS attacks), each Set/Dictionary now uses a different hash seed, and this seed also needs to change whenever the hash table is resized. This requires rehashing every key whenever we resize the table.

This means this is now desirable/expected behavior; we cannot change it without incurring a large performance regression. (As well as an ABI break, since hash table layouts are part of the stdlib ABI.)

swift-ci transferred this issue from apple/swift-issues Apr 25, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SR-3330] inefficiencies in Set / Dictionary use of hash #45918

[SR-3330] inefficiencies in Set / Dictionary use of hash #45918

mattneub commented Dec 5, 2016

robinkunde commented Dec 5, 2016

swift-ci commented Dec 8, 2016

mattneub commented Dec 8, 2016

natecook1000 commented Dec 8, 2016

lorentey commented Apr 7, 2022

[SR-3330] inefficiencies in Set / Dictionary use of hash #45918

[SR-3330] inefficiencies in Set / Dictionary use of hash #45918

Comments

mattneub commented Dec 5, 2016

robinkunde commented Dec 5, 2016

swift-ci commented Dec 8, 2016

mattneub commented Dec 8, 2016

natecook1000 commented Dec 8, 2016

lorentey commented Apr 7, 2022