[SR-3268] Hash-Dependent Iteration Order Produces Quadratic Behaviour in Dictionary/Set #45856

swift-ci · 2016-11-24T05:34:15Z


Previous ID	SR-3268
Radar	None
Original Reporter	Gankro (JIRA User)
Type	Bug
Status	Resolved
Resolution	Done

Attachment: Download

Environment

Should be true in at least Swift 2 and Swift 3, and all platforms.

Additional Detail from JIRA


Votes	2
Component/s	Standard Library
Labels	Bug, Performance
Assignee	@lorentey
Priority	Medium

md5: ee40175f0f6337a3ee378a50f0a30cdf

Issue Description:

Rust recently ran into a nasty bug which led to quadratic behaviour in appending the contents of one HashMap to another. Swift has the same problem for Dictionary/Set. The problem is explained excellently here: http://accidentallyquadratic.tumblr.com/post/153545455987/rust-hash-iteration-reinsertion

In essence, the problem requires:

iteration order to be tied to where elements hash in the collection (linear scan of the array)
a linear-probing-based collision resolution algorithm (FCFS for Swift, Robin Hood for Rust)
hashes mapped to indices via modulo (masking off bits)
max load factor greater than 50% (75% for Swift, 90% for Rust) (? not 100% sure if <50% is adequate)
inserting the elements of one map into another with less capacity (as would naturally occur when inserting all the elements of a map into a freshly created one without reserving space)

With all of these combined, the larger map has become a soft hash collision computer for the smaller map. Elements near the start of the first and second half of the larger map are guaranteed to be mapped near the start of the smaller map. Once iteration of the larger map hits the midpoint of its array, this leads to tons of collisions at the start of the smaller map's array, snowballing into linear search times for empty buckets.

Rust's solution was to use a different seed for its hash function for every hashmap (in fact, older versions of Rust already did this, the change to a global seed was an attempted optimization). I'm not sure if this is even possible while we're using Foundation's hashcode system.

Alternatively you could come up with some iteration scheme which is decoupled from the hash. Naively shifting the linear scan is inadequate. This is likely to hurt iteration performance under normal operation, as it uses cache worse.

Alternatively we could move to a fancier collision resolution scheme quadratic probing?). This is likely to hurt performance on normal operation as it trashes the cache.

Alternatively we could lower the load factor❓. This will hurt iteration speed and increase memory usage in common usage.

We could also just decide that this code is "bad" and it performing poorly is acceptable. I'm not a fan of this solution, but y'know, it's there.

Here's a sample program exhibiting the problematic behaviour.

let size = Int(CommandLine.arguments[1])!

var dict1 = [Int:Int]()

for i in 0..<size {
    dict1[i] = i * 2
}

var dict2 = [Int:Int]()

for (k, v) in dict1 {
    dict2[k] = v
}

print("computed value: \(dict2[size/2]!)")

Running this on different input sizes shows clearly quadratic behaviour:

swiftc -O dict.swift

time ./dict 1000000
computed value: 1000000

real    0m0.250s
user    0m0.197s
sys 0m0.045s

--------
time ./dict 2000000
computed value: 2000000

real    0m0.498s
user    0m0.400s
sys 0m0.094s

-----
time ./dict 3000000
computed value: 3000000

real    0m9.617s
user    0m9.498s
sys 0m0.102s

-----------
time ./dict 4000000
computed value: 4000000

real    0m1.089s
user    0m0.880s
sys 0m0.195s

---------
time ./dict 5000000
computed value: 5000000

real    4m45.307s
user    4m43.069s
sys 0m1.143s

The dip in run time at 4,000,000 is a result of the behaviour being very load factor sensitive.

The text was updated successfully, but these errors were encountered:

jtbandes · 2016-11-24T07:09:30Z

This is pretty interesting. It would be great to get some benchmarks added here: https://github.com/apple/swift/tree/master/benchmark

swift-ci · 2016-11-24T23:16:18Z

Comment by Alexis Beingessner (JIRA)

Adding a benchmark now would probably be inappropriate as it would DDOS the benchmarking infra. One should definitely be added with a fix.

jtbandes · 2016-11-24T23:38:45Z

It could be a small benchmark for now 🙂

natecook1000 · 2016-12-08T21:57:02Z

This is such a neat bug. I read the Rust article but hadn’t looked at whether it affects Swift collections as well—good find!

One way to fix this without storing an additional per-collection seed is to XOR the storage capacity with an element's hash value before squeezing. If the source and destination collections have the same capacity the quadratic behavior doesn't show up, and if they don't have the same capacity that will guarantee that they get a different hash order for the same elements, eliminating the blow ups. There's an additional cost for bucket lookups this way but you avoid the quadratic behavior.

milseman · 2018-03-12T21:52:09Z

You don't need to store a per-collection seed, because each collection already has a source of identity: the memory address of its storage allocation. You can use relevant bits from there, e.g.

 addressBits[48:4]

to seed hashing.

lorentey · 2018-03-12T22:14:54Z

We're certainly susceptible to this effect; I attached some benchmarks (from #15017 (comment) demonstrating quadratic behavior for `dict.filter { _ in true }`. It's a wonderful gem of a benchmark curve, beautifully alternating between linear and quadratic segments. It's almost a shame we need to get rid of it! 🙂

Set and Dictionary currently use a global random seed; we should perturb that seed with something to eliminate this issue. Simply XORing either the storage address or the capacity to the seed would eliminate the O(n^2) behavior.

Per-instance seeding with the storage address is generally a good idea, except in cases where we want repeatable results – in test suites and the like. In those environments, we disable random hash seeding, but these operations should still not regress to quadratic performance. So when a hash seed override is in effect, we should still use the storage capacity as the hash seed.

milseman · 2018-03-12T22:26:28Z

Is the storage capacity the requested capacity, or the real spare capacity of the allocation? If the latter, then that's still a source of instability.

edit: This is assuming that malloc_size could give different answers for the same sized malloc.

lorentey · 2018-03-12T22:35:45Z

That's true, but Set/Dictionary ordering already depends on the allocated capacity, so making the dependency even tighter would not cause any regressions.

Set/Dictionary is pretty good about having repeatable allocation patterns, so capacity-dependent ordering doesn't typically lead to nondeterminism in tests that are otherwise deterministic.

edit: We don't currently make use of any extra space that malloc gives us. Our hash tables are restricted to power-of-two sizes, which makes it difficult to change this.

milseman · 2018-03-13T00:12:54Z

Why are power-of-two sizes required? I know that such restrictions are common in quadratically probed hash tables (e.g. LLVM's), so that the traversal is more evenly distributed modulo modulo, but am unfamiliar with why a linearly probed hash table would require this.

swift-ci · 2018-03-13T00:27:17Z

Comment by Alexis Beingessner (JIRA)

> Why are power-of-two sizes required?

It's so we can store the cap as `mask = realCap - 1` and implement all accesses as `idx & mask`.

milseman · 2018-03-13T00:29:37Z

That sounds like a good reason.

lorentey · 2018-03-19T16:46:56Z

Fixed in #15239; the "Quadratic.attaresult" chart (attached) plots the per-element costs of Dictionary.filter { _ in true } before and after the fix.

swift-ci transferred this issue from apple/swift-issues Apr 25, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SR-3268] Hash-Dependent Iteration Order Produces Quadratic Behaviour in Dictionary/Set #45856

[SR-3268] Hash-Dependent Iteration Order Produces Quadratic Behaviour in Dictionary/Set #45856

swift-ci commented Nov 24, 2016

jtbandes commented Nov 24, 2016

swift-ci commented Nov 24, 2016

jtbandes commented Nov 24, 2016

natecook1000 commented Dec 8, 2016

milseman mannequin commented Mar 12, 2018

lorentey commented Mar 12, 2018

milseman mannequin commented Mar 12, 2018

lorentey commented Mar 12, 2018

milseman mannequin commented Mar 13, 2018

swift-ci commented Mar 13, 2018

milseman mannequin commented Mar 13, 2018

lorentey commented Mar 19, 2018

[SR-3268] Hash-Dependent Iteration Order Produces Quadratic Behaviour in Dictionary/Set #45856

[SR-3268] Hash-Dependent Iteration Order Produces Quadratic Behaviour in Dictionary/Set #45856

Comments

swift-ci commented Nov 24, 2016

jtbandes commented Nov 24, 2016

swift-ci commented Nov 24, 2016

jtbandes commented Nov 24, 2016

natecook1000 commented Dec 8, 2016

milseman mannequin commented Mar 12, 2018

lorentey commented Mar 12, 2018

milseman mannequin commented Mar 12, 2018

lorentey commented Mar 12, 2018

milseman mannequin commented Mar 13, 2018

swift-ci commented Mar 13, 2018

milseman mannequin commented Mar 13, 2018

lorentey commented Mar 19, 2018