[SR-1824] Swift 3 regression makes this test > 200 times slower with Swift 3 than with Swift 2.2 #44433

jepers · 2016-06-19T07:06:41Z


Previous ID	SR-1824
Radar	rdar://problem/26900080
Original Reporter	@jepers
Type	Bug
Status	Closed
Resolution	Done

Environment

OS X 10.11.5, Xcode 8.0 beta (8S128d)

Additional Detail from JIRA


Votes	0
Component/s	Compiler
Labels	Bug, 3.0Regression, Performance
Assignee	@aschwaighofer
Priority	Medium

md5: 892fa44eee9598f9cb460581a93a350c

Issue Description:

//------------------------------------------------------------------------------
// Demonstration of a Swift 3 regression
//------------------------------------------------------------------------------
// Please see my issue https://bugs.swift.org/browse/SR-203 which was fixed by
// edf9ca0 Unroll loops with known short trip count.
//
// Turns out that in Swift 3, the situation is worse than when I reported the
// above issue, SR-203, back in 12 dec 2015.
//
// This program is the same as that in SR-203, only adapted for Swift 3.
//
// It runs more than 200 times slower with Swift 3 than with Swift 2.2. : (
//
// In the comments below I have attached the results for
// Swift 3, Xcode 8.0 beta (8S128d)
// Swift 2.2, Xcode 7.3.1 (7D1014)
// and I've also kept the results from when I reported SR-203.
//
// NOTE:
// When this example work as expected (as it does/did in Xcode 7.3.1 (7D1014)),
// my generic struct V4 should be as fast as when using simd type, ie
// V4<Float> should be as fast as float4,
// and
// V4<V4<Float>> should be as fast as float4x4.
//
//------------------------------------------------------------------------------


import QuartzCore // (for CACurrentMediaTime)
import simd

let manuallyUnrolled = true // <-- Note how/if/why results depends on this flag.

//------------------------------------------------------------------------------
// Requirements and defaults for testable types (see the test at the end).
//------------------------------------------------------------------------------
protocol DefaultInitializable { init() }
protocol Testable : DefaultInitializable {
    static func random() -> Self
    func +(lhs: Self, rhs: Self) -> Self
}
extension Testable {
    static func random(count: Int) -> [Self] { return (0 ..< count).map { _ in Self.random() } }
}
extension Float    : Testable { static func random() -> Float {
    return unsafeBitCast(UInt32(127 << 23) | (arc4random() & 0x7fffff), to: Float.self) - 1.0 } // [0, 1)
}
extension float4   : Testable { static func random() -> float4   { return float4(Float.random(count: 4)) } }
extension float4x4 : Testable { static func random() -> float4x4 { return float4x4(float4.random(count: 4)) } }

//------------------------------------------------------------------------------
// A generic testable type with a "static storage" of four testable elements.
// Note that V4<V4<Float>> makes a (Testable) 4x4 float "matrix" type.
//------------------------------------------------------------------------------
struct V4<T: Testable> : Testable {
    var elements: (T, T, T, T)
    init(_ e0: T, _ e1: T, _ e2: T, _ e3: T) { elements = (e0, e1, e2, e3) }
    init() { self.init(T(), T(), T(), T()) }
    static func random() -> V4 { return .init(T.random(), T.random(), T.random(), T.random()) }
    var count: Int { return 4 }
    var indices: Range<Int> { return 0 ..< count }
    subscript(index: Int) -> T {
        get { precondition(0 <= index && index < count, "Index out of bounds")
            var selfCopy = self; return withUnsafePointer(&selfCopy) { UnsafePointer<T>($0)[index] } }
        set { precondition(0 <= index && index < count, "Index out of bounds")
            withUnsafeMutablePointer(&self) { UnsafeMutablePointer<T>($0)[index] = newValue } }
        // (Implementing subscript with eg switch case instead of unsafe ptr produces the same results.)
    }
    func mapWith(_ other: V4, transform: (T, T) -> T) -> V4 {
        var r = V4()
        if manuallyUnrolled {
            r[0] = transform(self[0], other[0])
            r[1] = transform(self[1], other[1])
            r[2] = transform(self[2], other[2])
            r[3] = transform(self[3], other[3])
        } else {
            for i in 0 ..< 4 { r[i] = transform(self[i], other[i]) }
        }
        return r
    }
}
func +<T: Testable>(lhs: V4<T>, rhs: V4<T>) -> V4<T> { return lhs.mapWith(rhs, transform: +) }

//------------------------------------------------------------------------------
// The test (sum a million random values of the given testable type)
//------------------------------------------------------------------------------
func test<T: Testable>(_: T.Type) {
    let a: [T] = T.random(count: 1_000_000)
    var times = [Double]()
    var sum = T()
    for _ in 0 ..< 7 {
        let t0 = CACurrentMediaTime()
        for i in a.indices { sum = sum + a[i] }
        let t1 = CACurrentMediaTime()
        times.append(t1 - t0)
    }
    print(String(format: "  %@ median time: %9.6f s   (DCE-prevention: %lld)",
                 { $0 + String(repeating: Character(" "), count: 13 - $0.characters.count) }("\(T.self)"),
                 times.sorted()[times.count / 2],
                 "\(sum)".hashValue & 0xffff))
}
print("Running test compiled with manuallyUnrolled = \(manuallyUnrolled):")
test(float4)
test(V4<Float>)
test(float4x4)
test(V4<V4<Float>>)


//==============================================================================
// Current example results with Swift 3, Xcode 8.0 beta (8S128d)
//==============================================================================
// 1. Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.000947 s   (DCE-prevention: 64318)
//      V4<Float>     median time:  0.092634 s   (DCE-prevention: 35151)
//      float4x4      median time:  0.003706 s   (DCE-prevention: 49514)
//      V4<V4<Float>> median time:  0.798284 s   (DCE-prevention: 39690)
//
// 2. Running test compiled with manuallyUnrolled = false:
//      float4        median time:  0.001134 s   (DCE-prevention: 5130)
//      V4<Float>     median time:  0.092355 s   (DCE-prevention: 30665)
//      float4x4      median time:  0.003196 s   (DCE-prevention: 52707)
//      V4<V4<Float>> median time:  0.786128 s   (DCE-prevention: 53556)
//------------------------------------------------------------------------------

//==============================================================================
// Current example results with Swift 2.2, Xcode 7.3.1 (7D1014)  (EXPECTED RES.)
//==============================================================================
// 1. Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.000951 s   (DCE-prevention: 17675)
//      V4<Float>     median time:  0.000946 s   (DCE-prevention: 49217)
//      float4x4      median time:  0.003488 s   (DCE-prevention: 623)
//      V4<V4<Float>> median time:  0.003433 s   (DCE-prevention: 5820)
//
// 2. Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.000952 s   (DCE-prevention: 60271)
//      V4<Float>     median time:  0.000977 s   (DCE-prevention: 10724)
//      float4x4      median time:  0.003468 s   (DCE-prevention: 26118)
//      V4<V4<Float>> median time:  0.003399 s   (DCE-prevention: 8786)
//------------------------------------------------------------------------------

//==============================================================================
// Old example results, from when I reported SR-203
//==============================================================================
// Using swiftc built from (then) latest open source master (built using
// swift/utils/build-script -R --no-assertions --no-swift-stdlib-assertions):
//
// 1. Compiling and running with manuallyUnrolled set to true:
//    $ xcrun ~/apple/build/Ninja-Release/swift-macosx-x86_64/bin/swiftc -O unroll-test.swift && ./unroll-test
//    Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.000961 s   (DCE-prevention: 7687)
//      V4<Float>     median time:  0.000948 s   (DCE-prevention: 28622)
//      float4x4      median time:  0.003454 s   (DCE-prevention: 42956)
//      V4<V4<Float>> median time:  0.003257 s   (DCE-prevention: 43011)
//
// 2. Compiling and running with manuallyUnrolled set to false:
//    $ xcrun ~/apple/build/Ninja-Release/swift-macosx-x86_64/bin/swiftc -O unroll-test.swift && ./unroll-test
//    Running test compiled with manuallyUnrolled = false:
//      float4        median time:  0.000963 s   (DCE-prevention: 51683)
//      V4<Float>     median time:  0.001099 s   (DCE-prevention: 52176)
//      float4x4      median time:  0.003531 s   (DCE-prevention: 41733)
//      V4<V4<Float>> median time:  0.015062 s   (DCE-prevention: 37967)
//
// Note that in 1, code using my custom types is optimimized to be as fast as
// the SIMD float4 and float4x4, while in 2, especially V4<V4<Float>> is about
// four times slower.
//
//------------------------------------------------------------------------------
// Using swiftc from Xcode 7.2 (7C68). NOTE: This is irrelevant for the issue at
// hand, but it's nice to see how much better this test performs when using the
// current open source master. Also, it might be interesting to note that the
// manuallyUnrolled flag doesn't make a difference with the swiftc of Xcode 7.2:
//
// 3. Compiling and running with manuallyUnrolled set to true:
//    $ swiftc -O unroll-test.swift && ./unroll-test
//    Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.000990 s   (DCE-prevention: 31706)
//      V4<Float>     median time:  0.088104 s   (DCE-prevention: 58860)
//      float4x4      median time:  0.003442 s   (DCE-prevention: 8133)
//      V4<V4<Float>> median time: 10.511913 s   (DCE-prevention: 42261)
//
// 4. Compiling and running with manuallyUnrolled set to false:
//    $ swiftc -O unroll-test.swift && ./unroll-test
//    Running test compiled with manuallyUnrolled = false:
//      float4        median time:  0.000946 s   (DCE-prevention: 10443)
//      V4<Float>     median time:  0.091342 s   (DCE-prevention: 3630)
//      float4x4      median time:  0.003782 s   (DCE-prevention: 26092)
//      V4<V4<Float>> median time: 10.895031 s   (DCE-prevention: 38005)
//------------------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

jepers · 2016-06-19T07:07:52Z

aschwaighofer@apple.com (JIRA User)

aschwaighofer · 2016-06-29T18:35:50Z

The inliner is less aggressive these days and so we end up not inlining a function (mapWith) which had previously enabled removing a closure.

This can be fixed by annotating mapWith with @inline(__always).

I will hopefully commit a change to the optimizer that will make this annotation unnecessary.

The closure specializer should have inlined the closure into mapWith. If #3257 goes in it will do so. This will again enable removal of the closure.

In my experiments this recovers performance.

aschwaighofer · 2016-06-29T22:23:57Z

Should be fixed by 8f3d26d.

jepers · 2016-07-08T15:21:49Z

(Just ran the above code in Xcode 8 beta 2, got the same results as with the first Xcode 8 beta, so I guess 8f3d26d wasn't included in beta 2.)

aschwaighofer · 2016-07-08T19:30:24Z

You are right. The fix did not make it into beta 2.

aschwaighofer · 2016-07-19T19:59:02Z

Hi Jens,
can you try with beta 3? It should be fixed now. Worked for me.

Thank you

jepers · 2016-07-19T20:58:43Z

Thanks, works for me too now:

//==============================================================================
// Example results with Swift 3, Xcode 8.0 beta 3 (8S174q)
//==============================================================================
// 1. Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.001327 s   (DCE-prevention: 57874)
//      V4<Float>     median time:  0.000946 s   (DCE-prevention: 17169)
//      float4x4      median time:  0.003460 s   (DCE-prevention: 15613)
//      V4<V4<Float>> median time:  0.003290 s   (DCE-prevention: 29086)
// 2. Running test compiled with manuallyUnrolled = false:
//      float4        median time:  0.001010 s   (DCE-prevention: 47972)
//      V4<Float>     median time:  0.000949 s   (DCE-prevention: 37396)
//      float4x4      median time:  0.003381 s   (DCE-prevention: 15376)
//      V4<V4<Float>> median time:  0.003195 s   (DCE-prevention: 26322)
//------------------------------------------------------------------------------

aschwaighofer · 2016-07-19T21:04:31Z

Great! Closing the bug.

jepers · 2016-09-16T22:56:17Z

aschwaighofer@apple.com (JIRA User)
A related question:
This shows that the optimizer can recognize 4 and 16 successive float additions and produce code that is as fast as if we had used simd float4 and float4x4 instead of eg V4<Float> and V4<V4<Float>>.

Is this only true for (float) addition or does the compiler also support other simd (types and) operations, like dot product, matrix multiplication, etc?

That is, if I implemented dot product for V4<Float>, would the optimizer be able to make it as fast as the simd dot product (for float4)?

swift-ci transferred this issue from apple/swift-issues Apr 25, 2022

AnthonyLatsis added swift 3.0 regression and removed 3.0 regression labels Nov 19, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SR-1824] Swift 3 regression makes this test > 200 times slower with Swift 3 than with Swift 2.2 #44433

[SR-1824] Swift 3 regression makes this test > 200 times slower with Swift 3 than with Swift 2.2 #44433

jepers commented Jun 19, 2016

jepers commented Jun 19, 2016

aschwaighofer commented Jun 29, 2016

aschwaighofer commented Jun 29, 2016

jepers commented Jul 8, 2016

aschwaighofer commented Jul 8, 2016

aschwaighofer commented Jul 19, 2016

jepers commented Jul 19, 2016

aschwaighofer commented Jul 19, 2016

jepers commented Sep 16, 2016

[SR-1824] Swift 3 regression makes this test > 200 times slower with Swift 3 than with Swift 2.2 #44433

[SR-1824] Swift 3 regression makes this test > 200 times slower with Swift 3 than with Swift 2.2 #44433

Comments

jepers commented Jun 19, 2016

jepers commented Jun 19, 2016

aschwaighofer commented Jun 29, 2016

aschwaighofer commented Jun 29, 2016

jepers commented Jul 8, 2016

aschwaighofer commented Jul 8, 2016

aschwaighofer commented Jul 19, 2016

jepers commented Jul 19, 2016

aschwaighofer commented Jul 19, 2016

jepers commented Sep 16, 2016