Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-1824] Swift 3 regression makes this test > 200 times slower with Swift 3 than with Swift 2.2 #44433

Closed
jepers opened this issue Jun 19, 2016 · 9 comments
Assignees
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. compiler The Swift compiler in itself performance regression swift 3.0

Comments

@jepers
Copy link

jepers commented Jun 19, 2016

Previous ID SR-1824
Radar rdar://problem/26900080
Original Reporter @jepers
Type Bug
Status Closed
Resolution Done
Environment

OS X 10.11.5, Xcode 8.0 beta (8S128d)

Additional Detail from JIRA
Votes 0
Component/s Compiler
Labels Bug, 3.0Regression, Performance
Assignee @aschwaighofer
Priority Medium

md5: 892fa44eee9598f9cb460581a93a350c

Issue Description:

//------------------------------------------------------------------------------
// Demonstration of a Swift 3 regression
//------------------------------------------------------------------------------
// Please see my issue https://bugs.swift.org/browse/SR-203 which was fixed by
// edf9ca0 Unroll loops with known short trip count.
//
// Turns out that in Swift 3, the situation is worse than when I reported the
// above issue, SR-203, back in 12 dec 2015.
//
// This program is the same as that in SR-203, only adapted for Swift 3.
//
// It runs more than 200 times slower with Swift 3 than with Swift 2.2. : (
//
// In the comments below I have attached the results for
// Swift 3, Xcode 8.0 beta (8S128d)
// Swift 2.2, Xcode 7.3.1 (7D1014)
// and I've also kept the results from when I reported SR-203.
//
// NOTE:
// When this example work as expected (as it does/did in Xcode 7.3.1 (7D1014)),
// my generic struct V4 should be as fast as when using simd type, ie
// V4<Float> should be as fast as float4,
// and
// V4<V4<Float>> should be as fast as float4x4.
//
//------------------------------------------------------------------------------


import QuartzCore // (for CACurrentMediaTime)
import simd

let manuallyUnrolled = true // <-- Note how/if/why results depends on this flag.

//------------------------------------------------------------------------------
// Requirements and defaults for testable types (see the test at the end).
//------------------------------------------------------------------------------
protocol DefaultInitializable { init() }
protocol Testable : DefaultInitializable {
    static func random() -> Self
    func +(lhs: Self, rhs: Self) -> Self
}
extension Testable {
    static func random(count: Int) -> [Self] { return (0 ..< count).map { _ in Self.random() } }
}
extension Float    : Testable { static func random() -> Float {
    return unsafeBitCast(UInt32(127 << 23) | (arc4random() & 0x7fffff), to: Float.self) - 1.0 } // [0, 1)
}
extension float4   : Testable { static func random() -> float4   { return float4(Float.random(count: 4)) } }
extension float4x4 : Testable { static func random() -> float4x4 { return float4x4(float4.random(count: 4)) } }

//------------------------------------------------------------------------------
// A generic testable type with a "static storage" of four testable elements.
// Note that V4<V4<Float>> makes a (Testable) 4x4 float "matrix" type.
//------------------------------------------------------------------------------
struct V4<T: Testable> : Testable {
    var elements: (T, T, T, T)
    init(_ e0: T, _ e1: T, _ e2: T, _ e3: T) { elements = (e0, e1, e2, e3) }
    init() { self.init(T(), T(), T(), T()) }
    static func random() -> V4 { return .init(T.random(), T.random(), T.random(), T.random()) }
    var count: Int { return 4 }
    var indices: Range<Int> { return 0 ..< count }
    subscript(index: Int) -> T {
        get { precondition(0 <= index && index < count, "Index out of bounds")
            var selfCopy = self; return withUnsafePointer(&selfCopy) { UnsafePointer<T>($0)[index] } }
        set { precondition(0 <= index && index < count, "Index out of bounds")
            withUnsafeMutablePointer(&self) { UnsafeMutablePointer<T>($0)[index] = newValue } }
        // (Implementing subscript with eg switch case instead of unsafe ptr produces the same results.)
    }
    func mapWith(_ other: V4, transform: (T, T) -> T) -> V4 {
        var r = V4()
        if manuallyUnrolled {
            r[0] = transform(self[0], other[0])
            r[1] = transform(self[1], other[1])
            r[2] = transform(self[2], other[2])
            r[3] = transform(self[3], other[3])
        } else {
            for i in 0 ..< 4 { r[i] = transform(self[i], other[i]) }
        }
        return r
    }
}
func +<T: Testable>(lhs: V4<T>, rhs: V4<T>) -> V4<T> { return lhs.mapWith(rhs, transform: +) }

//------------------------------------------------------------------------------
// The test (sum a million random values of the given testable type)
//------------------------------------------------------------------------------
func test<T: Testable>(_: T.Type) {
    let a: [T] = T.random(count: 1_000_000)
    var times = [Double]()
    var sum = T()
    for _ in 0 ..< 7 {
        let t0 = CACurrentMediaTime()
        for i in a.indices { sum = sum + a[i] }
        let t1 = CACurrentMediaTime()
        times.append(t1 - t0)
    }
    print(String(format: "  %@ median time: %9.6f s   (DCE-prevention: %lld)",
                 { $0 + String(repeating: Character(" "), count: 13 - $0.characters.count) }("\(T.self)"),
                 times.sorted()[times.count / 2],
                 "\(sum)".hashValue & 0xffff))
}
print("Running test compiled with manuallyUnrolled = \(manuallyUnrolled):")
test(float4)
test(V4<Float>)
test(float4x4)
test(V4<V4<Float>>)


//==============================================================================
// Current example results with Swift 3, Xcode 8.0 beta (8S128d)
//==============================================================================
// 1. Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.000947 s   (DCE-prevention: 64318)
//      V4<Float>     median time:  0.092634 s   (DCE-prevention: 35151)
//      float4x4      median time:  0.003706 s   (DCE-prevention: 49514)
//      V4<V4<Float>> median time:  0.798284 s   (DCE-prevention: 39690)
//
// 2. Running test compiled with manuallyUnrolled = false:
//      float4        median time:  0.001134 s   (DCE-prevention: 5130)
//      V4<Float>     median time:  0.092355 s   (DCE-prevention: 30665)
//      float4x4      median time:  0.003196 s   (DCE-prevention: 52707)
//      V4<V4<Float>> median time:  0.786128 s   (DCE-prevention: 53556)
//------------------------------------------------------------------------------

//==============================================================================
// Current example results with Swift 2.2, Xcode 7.3.1 (7D1014)  (EXPECTED RES.)
//==============================================================================
// 1. Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.000951 s   (DCE-prevention: 17675)
//      V4<Float>     median time:  0.000946 s   (DCE-prevention: 49217)
//      float4x4      median time:  0.003488 s   (DCE-prevention: 623)
//      V4<V4<Float>> median time:  0.003433 s   (DCE-prevention: 5820)
//
// 2. Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.000952 s   (DCE-prevention: 60271)
//      V4<Float>     median time:  0.000977 s   (DCE-prevention: 10724)
//      float4x4      median time:  0.003468 s   (DCE-prevention: 26118)
//      V4<V4<Float>> median time:  0.003399 s   (DCE-prevention: 8786)
//------------------------------------------------------------------------------

//==============================================================================
// Old example results, from when I reported SR-203
//==============================================================================
// Using swiftc built from (then) latest open source master (built using
// swift/utils/build-script -R --no-assertions --no-swift-stdlib-assertions):
//
// 1. Compiling and running with manuallyUnrolled set to true:
//    $ xcrun ~/apple/build/Ninja-Release/swift-macosx-x86_64/bin/swiftc -O unroll-test.swift && ./unroll-test
//    Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.000961 s   (DCE-prevention: 7687)
//      V4<Float>     median time:  0.000948 s   (DCE-prevention: 28622)
//      float4x4      median time:  0.003454 s   (DCE-prevention: 42956)
//      V4<V4<Float>> median time:  0.003257 s   (DCE-prevention: 43011)
//
// 2. Compiling and running with manuallyUnrolled set to false:
//    $ xcrun ~/apple/build/Ninja-Release/swift-macosx-x86_64/bin/swiftc -O unroll-test.swift && ./unroll-test
//    Running test compiled with manuallyUnrolled = false:
//      float4        median time:  0.000963 s   (DCE-prevention: 51683)
//      V4<Float>     median time:  0.001099 s   (DCE-prevention: 52176)
//      float4x4      median time:  0.003531 s   (DCE-prevention: 41733)
//      V4<V4<Float>> median time:  0.015062 s   (DCE-prevention: 37967)
//
// Note that in 1, code using my custom types is optimimized to be as fast as
// the SIMD float4 and float4x4, while in 2, especially V4<V4<Float>> is about
// four times slower.
//
//------------------------------------------------------------------------------
// Using swiftc from Xcode 7.2 (7C68). NOTE: This is irrelevant for the issue at
// hand, but it's nice to see how much better this test performs when using the
// current open source master. Also, it might be interesting to note that the
// manuallyUnrolled flag doesn't make a difference with the swiftc of Xcode 7.2:
//
// 3. Compiling and running with manuallyUnrolled set to true:
//    $ swiftc -O unroll-test.swift && ./unroll-test
//    Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.000990 s   (DCE-prevention: 31706)
//      V4<Float>     median time:  0.088104 s   (DCE-prevention: 58860)
//      float4x4      median time:  0.003442 s   (DCE-prevention: 8133)
//      V4<V4<Float>> median time: 10.511913 s   (DCE-prevention: 42261)
//
// 4. Compiling and running with manuallyUnrolled set to false:
//    $ swiftc -O unroll-test.swift && ./unroll-test
//    Running test compiled with manuallyUnrolled = false:
//      float4        median time:  0.000946 s   (DCE-prevention: 10443)
//      V4<Float>     median time:  0.091342 s   (DCE-prevention: 3630)
//      float4x4      median time:  0.003782 s   (DCE-prevention: 26092)
//      V4<V4<Float>> median time: 10.895031 s   (DCE-prevention: 38005)
//------------------------------------------------------------------------------
@jepers
Copy link
Author

jepers commented Jun 19, 2016

aschwaighofer@apple.com (JIRA User)

@aschwaighofer
Copy link
Member

The inliner is less aggressive these days and so we end up not inlining a function (mapWith) which had previously enabled removing a closure.

This can be fixed by annotating mapWith with @inline(__always).

I will hopefully commit a change to the optimizer that will make this annotation unnecessary.

The closure specializer should have inlined the closure into mapWith. If #3257 goes in it will do so. This will again enable removal of the closure.

In my experiments this recovers performance.

@aschwaighofer
Copy link
Member

Should be fixed by 8f3d26d.

@jepers
Copy link
Author

jepers commented Jul 8, 2016

(Just ran the above code in Xcode 8 beta 2, got the same results as with the first Xcode 8 beta, so I guess 8f3d26d wasn't included in beta 2.)

@aschwaighofer
Copy link
Member

You are right. The fix did not make it into beta 2.

@aschwaighofer
Copy link
Member

Hi Jens,
can you try with beta 3? It should be fixed now. Worked for me.

Thank you

@jepers
Copy link
Author

jepers commented Jul 19, 2016

Thanks, works for me too now:

//==============================================================================
// Example results with Swift 3, Xcode 8.0 beta 3 (8S174q)
//==============================================================================
// 1. Running test compiled with manuallyUnrolled = true:
//      float4        median time:  0.001327 s   (DCE-prevention: 57874)
//      V4<Float>     median time:  0.000946 s   (DCE-prevention: 17169)
//      float4x4      median time:  0.003460 s   (DCE-prevention: 15613)
//      V4<V4<Float>> median time:  0.003290 s   (DCE-prevention: 29086)
// 2. Running test compiled with manuallyUnrolled = false:
//      float4        median time:  0.001010 s   (DCE-prevention: 47972)
//      V4<Float>     median time:  0.000949 s   (DCE-prevention: 37396)
//      float4x4      median time:  0.003381 s   (DCE-prevention: 15376)
//      V4<V4<Float>> median time:  0.003195 s   (DCE-prevention: 26322)
//------------------------------------------------------------------------------

@aschwaighofer
Copy link
Member

Great! Closing the bug.

@jepers
Copy link
Author

jepers commented Sep 16, 2016

aschwaighofer@apple.com (JIRA User)
A related question:
This shows that the optimizer can recognize 4 and 16 successive float additions and produce code that is as fast as if we had used simd float4 and float4x4 instead of eg V4<Float> and V4<V4<Float>>.

Is this only true for (float) addition or does the compiler also support other simd (types and) operations, like dot product, matrix multiplication, etc?

That is, if I implemented dot product for V4<Float>, would the optimizer be able to make it as fast as the simd dot product (for float4)?

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. compiler The Swift compiler in itself performance regression swift 3.0
Projects
None yet
Development

No branches or pull requests

3 participants