Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-3330] inefficiencies in Set / Dictionary use of hash #45918

Closed
mattneub opened this issue Dec 5, 2016 · 5 comments
Closed

[SR-3330] inefficiencies in Set / Dictionary use of hash #45918

mattneub opened this issue Dec 5, 2016 · 5 comments
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. standard library Area: Standard library umbrella

Comments

@mattneub
Copy link

mattneub commented Dec 5, 2016

Previous ID SR-3330
Radar None
Original Reporter @mattneub
Type Bug
Status Resolved
Resolution Won't Do
Environment

Xcode 8.1

Additional Detail from JIRA
Votes 1
Component/s Standard Library
Labels Bug
Assignee None
Priority Medium

md5: a3d21b1bbf412f128589860284735c58

Issue Description:

This arises from a question posed on Stack Overflow: http://stackoverflow.com/questions/31665802/

In this simple example I've instrumented a rudimentary struct's `hashValue` and `==` so that we know when they are called:

struct S : Hashable {
    static func ==(lhs:S,rhs:S) -> Bool {
        print("called == for", lhs.id, rhs.id)
        return lhs.id == rhs.id
    }
    let id : Int
    var hashValue : Int {
        print("called hashValue for", self.id)
        return self.id
    }
    init(_ id:Int) {self.id = id}
}

We then create an empty Set and insert five unequal instances of that struct into it:

var s = Set<S>()
for i in 1...5 {
    print("inserting", i)
    s.insert(S(i))
}

What I expect to see is that the `hashValue` for each struct instance instance should be called once, on insertion, and that's all. The `hashValue` never changes, so there would be no reason to call it more than once; the Set should be keeping track of the hash values of the already present elements. Moreover, `==` should never be called, because no two hash values are the same (there are no "collisions").

But that is not at all what we do see. Here's the output (I've added spaces):

inserting 1
called hashValue for 1

inserting 2
called hashValue for 2
called == for 1 2
called hashValue for 1
called hashValue for 2

inserting 3
called hashValue for 3

inserting 4
called hashValue for 4
called == for 3 4
called == for 1 4
called hashValue for 2
called hashValue for 3
called hashValue for 1
called hashValue for 4
called == for 3 4
called == for 1 4

inserting 5
called hashValue for 5

I am not a computer scientist, but I find this mystifying. Not only is `==` called many times; it is called multiple times for the same pairs as part of the same insertion ("inserting 4" is the one I'm thinking of).

So the question is: are we non-computer-scientist types just misreading the evidence, or does this evidence in fact show that the implementation of `insert` is grossly inefficient? (And we have the same question about Dictionary, which uses the same algorithms and behaves the same way.)

@robinkunde
Copy link
Contributor

@mattneub In hashed collections, collisions can occur whenever the number of buckets is smaller than the keyspace. When you're creating a new Set without specifying a minimum capacity, the set might have only one bucket, so when you're inserting the second element, a collision occurs. The insert method will then decide if the storage should be grown, using something called a load factor. If the storage was grown, the existing elements have to be migrated over to the new storage buffer. That's when you're seeing all those extra calls to hashValue when inserting 4.

The reason you're still seeing more calls to == than you would expect if the number of buckets is equal or higher to the number of elements has to do with an implementation detail of the bucket index calculation. The bits of the hashValue are mixed or "shuffled" before the modulo operation. This is to cut down on excessive collisions for types with bad hash algorithms.

@swift-ci
Copy link
Collaborator

swift-ci commented Dec 8, 2016

Comment by Alexis Beingessner (JIRA)

What's actually happening:

  • We hash a value only once on insertion.

  • We don't use hashes for comparison of elements, only ==. Using hashes for comparison is only reasonable if you store the hashes, but that means more memory usage for every Dictionary. A compromise that needs evaluation.

  • We try to insert the element before evaluating if the Dictionary can fit that element. This is because the element might already be in the Dictionary, in which case we don't need any more capacity.

  • When we resize the Dictionary, we have to rehash everything, because we didn't store the hashes.

So what you're seeing is:

  • one hash of the search key

  • some =='s (searching for a space)

  • hashes of every element in the collection (resize)

  • one hash of the search key (actually totally wasteful, but not a big deal considering it only happens after an O👎 reallocation)

  • some =='s (searching for a space in the new buffer)

@mattneub
Copy link
Author

mattneub commented Dec 8, 2016

Beingessner (JIRA User) Thanks so much. Valuable indeed (and eye-opening). It is very kind of you to give such full and precise information.

@natecook1000
Copy link
Member

The talkative struct shows how useful the initializers that reserve capacity are—all the resizing is gone if you create 's' with 'Set.init(minimumCapacity:)':

var t = Set<S>(minimumCapacity: 5)
for i in 1...5 {  
    print("inserting", i)  
    t.insert(S(i)) 
} 
inserting 1
called hashValue for 1
inserting 2
called hashValue for 2
inserting 3
called hashValue for 3
inserting 4
called hashValue for 4
called == for 3 4
called == for 1 4
inserting 5
called hashValue for 5

@lorentey
Copy link
Member

lorentey commented Apr 7, 2022

To prevent accidentally quadratic behavior when merging dictionaries/sets (as well as an extra layer of protection against HashDoS attacks), each Set/Dictionary now uses a different hash seed, and this seed also needs to change whenever the hash table is resized. This requires rehashing every key whenever we resize the table.

This means this is now desirable/expected behavior; we cannot change it without incurring a large performance regression. (As well as an ABI break, since hash table layouts are part of the stdlib ABI.)

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. standard library Area: Standard library umbrella
Projects
None yet
Development

No branches or pull requests

5 participants