[SR-7150] Recent dev snapshots make this code 4 times slower #49698

jepers · 2018-03-09T05:13:09Z


Previous ID	SR-7150
Radar	None
Original Reporter	@jepers
Type	Bug

Attachment: Download

Additional Detail from JIRA


Votes	0
Component/s	Compiler
Labels	Bug
Assignee	None
Priority	Medium

md5: 5caab9c0d3e83a66d11a891bfae1db8b

relates to:

SR-7023 Missed optimization opportunity, but not missed if wrapping code in a nested func

Issue Description:

This is the test.swift program in the attachment:

//
// As discussed here:
// https://forums.swift.org/t/how-to-analyze-the-performance-of-for-example-this-little-program/10695
//
// This program demonstrates a regression in recent dev snapshots resulting
// in 4 times longer computation time for the last two of the four "range
// conversions" as seen below (on my MBP late 2013, macOS 10.13.3 (17D102)).
// 
//
// --------------------------------------------------------------------------
// Before Regression (using eg default toolchain of Xcode 9.2 and 9.3 beta 3)
// --------------------------------------------------------------------------
// › sudo xcode-select -s /Applications/Xcode-beta.app/
// › xcrun swiftc --version
// Apple Swift version 4.1 (swiftlang-902.0.41 clang-902.0.31)
// Target: x86_64-apple-darwin17.4.0
// 
// › xcrun swiftc -O -whole-module-optimization -gnone -static-stdlib test.swift
// › ./test
// 
// Range converting 100000000 values from UInt8 to UInt64:
//   trial 0: 0.132198705046903 seconds (checksum: 2170205185142300190)
//   trial 1: 0.109685704985168 seconds (checksum: 3761688987579986996)
//   trial 2: 0.110061422979925 seconds (checksum: 5642533481369980494)
//   trial 3: 0.114856946980581 seconds (checksum: 7740398493674204011)
//   trial 4: 0.111156655009836 seconds (checksum: 12659530246663417775)
// 
// Range converting 100000000 values from Int8 to UInt64:
//   trial 0: 0.109313847031444 seconds (checksum: 9476562641788044214)
//   trial 1: 0.11203276499873 seconds (checksum: 14323354221939181237)
//   trial 2: 0.117191826982889 seconds (checksum: 9548902814626120895)
//   trial 3: 0.109507674991619 seconds (checksum: 723401728380766742)
//   trial 4: 0.109779970021918 seconds (checksum: 2748926567846913555)
// 
// Range converting 100000000 values from UInt8 to Int64:
//   trial 0: 0.108220668975264 seconds (checksum: 5136152271503443783)
//   trial 1: 0.110781208029948 seconds (checksum: 3689348814741910323)
//   trial 2: 0.11080539698014 seconds (checksum: 4123389851770370361)
//   trial 3: 0.110172690998297 seconds (checksum: -4846791580151137092)
//   trial 4: 0.110611339041498 seconds (checksum: -6148914691236517206)
// 
// Range converting 100000000 values from Int8 to Int64:
//   trial 0: 0.110338026017416 seconds (checksum: 3544668469065757012)
//   trial 1: 0.122289317019749 seconds (checksum: -651061555542690115)
//   trial 2: 0.118017453001812 seconds (checksum: 1663823975275763472)
//   trial 3: 0.116950674972031 seconds (checksum: -3978709506094216977)
//   trial 4: 0.119433180021588 seconds (checksum: 6655295901103053869)
//
// NOTE: All four range conversions take about 0.1 seconds.
//
//
// --------------------------------------------------------------------------
// After Regression (I have tested dev snapshots 2018-02-25 ... 2018-03-08)
// --------------------------------------------------------------------------
// › /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-03-08-a.xctoolchain/usr/bin/swiftc \
// --version
// Apple Swift version 4.2-dev (LLVM c4ec2ab808, Clang af436f313e, Swift 74ba135008)
// Target: x86_64-apple-darwin17.4.0
//
// › /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-03-08-a.xctoolchain/usr/bin/swiftc \
// -sdk /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk \
// -O -whole-module-optimization -gnone -static-stdlib test.swift
// › ./test
// 
// Range converting 100000000 values from UInt8 to UInt64:
//   trial 0: 0.144382501021028 seconds (checksum: 8897841259083430779)
//   trial 1: 0.120100836968049 seconds (checksum: 7306357456645743973)
//   trial 2: 0.113219286024105 seconds (checksum: 8825501086245354106)
//   trial 3: 0.115648402017541 seconds (checksum: 5642533481369980494)
//   trial 4: 0.113908740982879 seconds (checksum: 6655295901103053916)
// 
// Range converting 100000000 values from Int8 to UInt64:
//   trial 0: 0.156202154990751 seconds (checksum: 6510615555426900600)
//   trial 1: 0.120782539015636 seconds (checksum: 7378697629483820613)
//   trial 2: 0.12854404799873 seconds (checksum: 9042521604759584122)
//   trial 3: 0.119271910982206 seconds (checksum: 15336116641672254691)
//   trial 4: 0.121582918975037 seconds (checksum: 13238251629368031171)
// 
// Range converting 100000000 values from UInt8 to Int64:
//   trial 0: 0.397824143001344 seconds (checksum: 6655295901103053916)
//   trial 1: 0.402513558976352 seconds (checksum: -723401728380766731)
//   trial 2: 0.400139409000985 seconds (checksum: -8897841259083430780)
//   trial 3: 0.402008056000341 seconds (checksum: 5136152271503443783)
//   trial 4: 0.403440554044209 seconds (checksum: -8463800222054970742)
// 
// Range converting 100000000 values from Int8 to Int64:
//   trial 0: 0.415403343038633 seconds (checksum: -1374463283923456881)
//   trial 1: 0.420072704029735 seconds (checksum: 8680820740569200707)
//   trial 2: 0.40942423598608 seconds (checksum: 2170205185142300169)
//   trial 3: 0.420196880993899 seconds (checksum: 1880844493789993556)
//   trial 4: 0.422151803970337 seconds (checksum: 4340410370284600346)
//
// NOTE: The first two are still about 0.1 s, but the last two are 0.4 s.
//       That is, for some reason the last two are almost 4 times slower.
//
// --------------------------------------------------------------------------


import AppKit


extension FixedWidthInteger {
    
    /// Creates a new instance by mapping `source` from the full range of
    /// `Src` to the full range of `Self`.
    /// So for example:
    ///
    ///     Int8(rangeConverted: UInt8(  0)) == Int8(-128)
    ///     Int8(rangeConverted: UInt8(128)) == Int8(   0)
    ///     Int8(rangeConverted: UInt8(255)) == Int8( 127)
    ///
    ///     UInt32(rangeConverted: UInt8(0x00)) == UInt32(0x00_00_00_00)
    ///     UInt32(rangeConverted: UInt8(0x80)) == UInt32(0x80_80_80_80)
    ///     UInt32(rangeConverted: UInt8(0xff)) == UInt32(0xff_ff_ff_ff)
    ///
    ///     UInt16(rangeConverted: Int64.min) == UInt16.min
    ///     UInt16(rangeConverted: Int64(0) ) == UInt16(
    ///                                              (Double(UInt16.max) / 2)
    ///                                              .rounded())
    ///     UInt16(rangeConverted: Int64.max) == UInt16.max
    ///
    /// - Parameter source: An integer to map from the full range of its type
    ///                     to the full range of this type.
    init<Src>(rangeConverted source: Src) where Src: FixedWidthInteger {
        let srcBW = Src.bitWidth
        let dstBW = Self.bitWidth
        let source = Src.isSigned
            ? source &+ (Src(1) << (srcBW &- 1))
            : source
        if srcBW < dstBW {
            self = Self(truncatingIfNeeded: source)
            var shifts = srcBW
            while shifts < dstBW {
                self |= self << shifts
                shifts = shifts << 1
            }
        } else if srcBW > dstBW {
            let shifts = srcBW &- dstBW
            self = Self(truncatingIfNeeded: source >> shifts)
        } else {
            self = Self(truncatingIfNeeded: source)
        }
        if Self.isSigned {
            self = self &+ (Self(1) << (dstBW &- 1))
        }
    }
}

// This type is only a simple example context for this particular benchmark.
final class Buffer<Element: FixedWidthInteger> {
    var data: UnsafeMutableBufferPointer<Element>
    init(count: Int) {
        precondition(count > 0)
        data = UnsafeMutableBufferPointer<Element>.allocate(capacity: count)
        data.initialize(repeating: 0)
    }
    func setAllElementsToRandomBitPatterns() {
        let rc = SecRandomCopyBytes(nil, data.count, data.baseAddress!)
        precondition(rc == errSecSuccess)
    }
    func setElements<E>(from other: Buffer<E>, transform: (E) -> Element) {
        precondition(other.data.count <= data.count)
        for i in data.indices {
            data[i] = transform(other.data[i])
        }
    }
    func checksum() -> Element {
        var checksum = Element(0)
        for e in data { checksum = checksum ^ e }
        return checksum
    }
    deinit {
        data.deallocate()
    }
}

func benchmark<S, D>(source _: S.Type, destination _: D.Type)
    where S: FixedWidthInteger, D: FixedWidthInteger
{
    let num = 100_000_000
    print("\nRange converting \(num) values from \(S.self) to \(D.self):")
    let srcBuf = Buffer<S>(count: num)
    let dstBuf = Buffer<D>(count: num)
    for trial in 0 ..< 5 {
        srcBuf.setAllElementsToRandomBitPatterns()
        let t0 = CACurrentMediaTime()
        dstBuf.setElements(from: srcBuf, transform: D.init(rangeConverted:))
        let t1 = CACurrentMediaTime()
        print("  trial \(trial):",
            t1 - t0, "seconds (checksum: \(dstBuf.checksum()))")
    }
}

// This function is just here for completeness, in case anyone should find
// it interesting to see the benchmarks for all (8 * 8 = 64) conversions.
func benchmarkAll() {
    func benchAllSources<D: FixedWidthInteger>(withDestination: D.Type) {
        benchmark(source: UInt8.self, destination: D.self)
        benchmark(source: UInt16.self, destination: D.self)
        benchmark(source: UInt32.self, destination: D.self)
        benchmark(source: UInt64.self, destination: D.self)
        benchmark(source: Int8.self, destination: D.self)
        benchmark(source: Int16.self, destination: D.self)
        benchmark(source: Int32.self, destination: D.self)
        benchmark(source: Int64.self, destination: D.self)
    }
    benchAllSources(withDestination: UInt8.self)
    benchAllSources(withDestination: UInt16.self)
    benchAllSources(withDestination: UInt32.self)
    benchAllSources(withDestination: UInt64.self)
    benchAllSources(withDestination: Int8.self)
    benchAllSources(withDestination: Int16.self)
    benchAllSources(withDestination: Int32.self)
    benchAllSources(withDestination: Int64.self)
}

// But for the purpose of this demonstration, it suffices to look at
// this subset of conversions:
func benchmark8To64() {
    benchmark(source: UInt8.self, destination: UInt64.self)
    benchmark(source: Int8.self, destination: UInt64.self)
    benchmark(source: UInt8.self, destination: Int64.self)
    benchmark(source: Int8.self, destination: Int64.self)
}

benchmark8To64()

The text was updated successfully, but these errors were encountered:

atrick · 2018-03-09T09:04:54Z

Here's what I did:

Confirmed the regression from my installed Swift 4.1 to a recent build of master.

Ran each compiled binary in Instruments to compare. There's nothing
interesting in the stack trace. There are only 4 functions of
interest. Instruments isn't showing me any samples or mangled names.

Looked at the binaries in Hopper. Identified one of the slow functions by looking at main's disassembly:

_$S4test9benchmark6source11destinationyxm_q_mts17FixedWidthIntegerRzsAER_r0_lFs5UInt8V_s5Int64VTg5Tf4dd_n

Run swift-demangle

---> function signature specialization <Arg[0] = Dead, Arg[1] = Dead> of generic specialization <Swift.UInt8, Swift.Int64> of test.benchmark<A, B where A: Swift.FixedWidthInteger, B: Swift.FixedWidthInteger>(source: A.Type, destination: B.Type) -> ()

Yep, that looks like the slow one. Search for that symbol in Hopper.

The fast code has a single block vectorized loop, 4 wide. (coincidence?)

The slow code has a bunch of compare-and-branches.

Create a reduced benchmark, because I don't want to look at a ton of
compiler output.

I created a fully specialized version using UnsafeBufferPointer. That
performed badly regardless of the compiler. That tells me it's not
really a regression but more of a nasty artifact of some inlining or
CFG simplification.

I went back to a somewhat generic version that still performs well:

@inline(never)
func foo(srcBuf: Buffer<UInt8>, dstBuf: Buffer<Int64>) {
  dstBuf.setElements(from: srcBuf, transform: Int64.init(rangeConverted:))
}

func benchmark8To64() {
  let num = 100_000_000
  print("\nRange converting \(num) values from \(UInt8.self) to \(Int64.self):")
  let srcBuf = Buffer<UInt8>(count: num)
  let dstBuf = Buffer<Int64>(count: num)
  srcBuf.setAllElementsToRandomBitPatterns()
  for _ in 0 ..< 10 {
    foo(srcBuf: srcBuf, dstBuf: dstBuf)
  }
}

benchmark8To64()

I grabbed the -frontend command line with swift -O test.swift -### and used that to see the LLVM IR:

swift -frontend -c -primary-file test8to64.swift -target x86_64-apple-darwin17.5.0 -enable-objc-interop -sdk /Applications/Xcode/External/9Q117m/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk -O -module-name test8 -emit-ir

Confirmed that the fast code looks like this (vectorized):

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %36 = getelementptr inbounds %Ts5UInt8V, %Ts5UInt8V* %32, i64 %index, i32 0
  %37 = bitcast i8* %36 to <2 x i8>*
  %wide.load = load <2 x i8>, <2 x i8>* %37, align 1, !alias.scope !35

Now, disable LLVM optimizations to see what the swift compiler is giving it:

swift -frontend -c -primary-file test8to64.swift -target x86_64-apple-darwin17.5.0 -enable-objc-interop -sdk /Applications/Xcode/External/9Q117m/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk -O -module-name test8 -emit-ir -disable-llvm-optzns

Copy-paste function "foo", which has only the code that I care about,
into a separate file (it has a different mangled name in the two
compilers). Look at a pretty view of that function:

$ /s/s/swift/utils/viewcfg tot < totpre.foo.ll

In the fast code, I can see branch that skips over one of the blocks:

; <label>:75:                                     ; preds = %70
  br label %52

; <label>:49:                                     ; preds = %47, %65, %68, %72
  %50 = phi i64 [ %74, %72 ], [ %69, %68 ], [ %67, %65 ], [ %48, %47 ]
  %51 = or i64 %40, %50
  br label %52

; <label>:52:                                     ; preds = %75, %49
  %53 = phi i64 [ %51, %49 ], [ %40, %75 ]
  %54 = shl i64 %39, 1
  %55 = icmp slt i64 %54, 64
  br i1 %55, label %56, label %57

; <label>:57:                                     ; preds = %52
  %58 = icmp eq i64 %10, 0
  br i1 %58, label %89, label %59

In the slow code

; <label>:73:                                     ; preds = %68
  br label %49

; <label>:49:                                     ; preds = %47, %63, %66, %73, %70
  %50 = phi i64 [ %72, %70 ], [ 0, %73 ], [ %67, %66 ], [ %65, %63 ], [ %48, %47 ]
  %51 = or i64 %40, %50
  %52 = shl i64 %39, 1
  %53 = icmp slt i64 %52, 64
  br i1 %53, label %54, label %55

; <label>:55:                                     ; preds = %49
  %56 = icmp eq i64 %10, 0
  br i1 %56, label %87, label %57

That's really the only difference.

It looks like the bitwidths are fully compiled away. But I see a
ridiculous amount of control flow and bounds checking within this tiny
loop:

            while shifts < dstBW {
                self |= self << shifts
                shifts = shifts << 1
            }

Rerun the -emit-ir command with -emit-sil -emit-verbose-sil instead to see where that junk comes from.

It's all coming from

specialized protocol witness for static BinaryInteger.<< infix<A>(_:_:)

As a workaround, maybe you want to use a masking shift

                self |= self &<< shifts

It would be great to understand why the performance is fragile. It
seems like some order of expression evaluation of non-masking shifts
was slightly adjusted.

I can't follow the stdlib code, and it would take a lot more debugging
to understand how everything was specialized and inlined. I would have
to guess that it's coming from this code:

  public static func _nonMaskingLeftShift(_ lhs: Self, _ rhs: Int) -> Self {

The closest commit I can find in the past year is this, but it's just a wild guess:

commit a5ff35cd41a8199cd94ab43a4f093a889ba8fdad
Author: Maxim Moiseev <moiseev@apple.com>
Date:   Fri Jul 21 14:46:54 2017 -0700

    [stdlib] extendingOrTruncating: => truncatingIfNeeded:

A separate question is why LLVM was able to vectorize one form of the
loop and not the other. This would also require a fair amount of
debugging on the LLVM side.

jepers · 2018-03-09T10:49:54Z

Thanks, very informative and appreciated!

Interesting that using &<< and &>> works around the issue.

Btw, even though it probably doesn't add any valuable information:

I just installed Xcode 9.3 beta 4 and its default toolchain is not reproducing the issue (just like the default toolchains of 9.2 and all previous 9.3 betas).

Not sure what version of the compiler went into 9.3 beta 4 but it was released mars 5, and yet I guess it must have some version before dev snapshot 2018-02-25 (which does reproduce the issue).

swiftc --version of Xcode 9.3 beta 4's default toolchain:

› xcodebuild -version
Xcode 9.3
Build version 9Q127n
› xcrun swiftc --version
Apple Swift version 4.1 (swiftlang-902.0.43 clang-902.0.37.1)
Target: x86_64-apple-darwin17.4.0

swiftc --version of dev snapshot 2018-02-25 (earliest snapshot I have access to):

› /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-02-25-a.xctoolchain/usr/bin/swiftc --version
Apple Swift version 4.1-dev (LLVM c4ec2ab808, Clang af436f313e, Swift 251093e1f4)
Target: x86_64-apple-darwin17.4.0

Is there perhaps some way of converting between the two different Swift (build?) version formats (if that's what they are) that are given by Xcode default toolchains and dev snapshots? That is:

swiftlang-902.0.43

and

Swift 251093e1f4

jepers · 2018-03-09T14:55:17Z

I noticed some other things:

I hadn't actually tested this on Xcode 9.2 as I've erroneously written, I thought I had. But the program won't even compile with Xcode 9.2 since the syntax for UnsafeMutableBufferPointer et al has changed since then. When I adjusted the code for this, it turns out that the issue can be reproduced with Xcode 9.2. So it's only the default toolchains of Xcode 9.3 betas that doesn't reproduce the issue.

It turns out that while using only masking shifts instead of only smart shifts works around the issue, if I instead use a certain combinations of both masking- and smart shifts, It is possible to reproduce the issue on the otherwise unaffected default toolchain of Xcode 9.3 beta 4 (and probably the earlier betas) too:

    init<Src>(rangeConverted source: Src) where Src: FixedWidthInteger {
        let srcBW = Src.bitWidth
        let dstBW = Self.bitWidth
        let source = Src.isSigned
            ? source &+ (Src(1) << (srcBW &- 1))
            : source
        if srcBW < dstBW {
            self = Self(truncatingIfNeeded: source)
            var shifts = srcBW
            while shifts < dstBW {
                self |= self << shifts
                shifts = shifts &<< 1 // <-- Use a masking shift only here.
            }
        } else if srcBW > dstBW {
            let shifts = srcBW &- dstBW
            self = Self(truncatingIfNeeded: source >> shifts)
        } else {
            self = Self(truncatingIfNeeded: source)
        }
        if Self.isSigned {
            self = self &+ (Self(1) << (dstBW &- 1))
        }
    }

atrick · 2018-03-09T17:40:13Z

That's all useful info. It also confirms that this isn't exactly a regression so much as general performance instability.

atrick · 2018-03-09T17:45:11Z

There's no direct way to translate an Xcode swift root to a github revision. I can tell you that

swiftlang-902.0.43 roughly has everything that went in by Feb 21.

swift-ci transferred this issue from apple/swift-issues Apr 25, 2022

AnthonyLatsis added the performance label Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SR-7150] Recent dev snapshots make this code 4 times slower #49698

[SR-7150] Recent dev snapshots make this code 4 times slower #49698

jepers commented Mar 9, 2018

atrick commented Mar 9, 2018

jepers commented Mar 9, 2018

jepers commented Mar 9, 2018

atrick commented Mar 9, 2018

atrick commented Mar 9, 2018

[SR-7150] Recent dev snapshots make this code 4 times slower #49698

[SR-7150] Recent dev snapshots make this code 4 times slower #49698

Comments

jepers commented Mar 9, 2018

atrick commented Mar 9, 2018

jepers commented Mar 9, 2018

jepers commented Mar 9, 2018

atrick commented Mar 9, 2018

atrick commented Mar 9, 2018