Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-7023] Missed optimization opportunity, but not missed if wrapping code in a nested func #49571

Closed
jepers opened this issue Feb 18, 2018 · 17 comments
Labels
compiler The Swift compiler in itself improvement performance

Comments

@jepers
Copy link

jepers commented Feb 18, 2018

Previous ID SR-7023
Radar None
Original Reporter @jepers
Type Improvement
Status Closed
Resolution Done

Attachment: Download

Additional Detail from JIRA
Votes 1
Component/s Compiler
Labels Improvement, Performance
Assignee None
Priority Medium

md5: d1719415de47fefc0d5b1f41eec99e68

relates to:

  • SR-6254 Redundant load elimination + alias analysis opportunity.
  • SR-6983 Unexpected drop in performance of (identity-)closure the second time it is used
  • SR-7150 Recent dev snapshots make this code 4 times slower

Issue Description:

As discussed in more detail here, the following code demonstrates that the execution speed of some code increases when simply wrapping it in a nested function, suggesting a missed optimization in the natural (not-wrapped-in-nested-func) "Normal Test" below.

import QuartzCore
import Security


extension FixedWidthInteger where Self: UnsignedInteger {
    
    /// Creates a value by mapping `source` from the full range of its type
    /// `S` to the full range of `Self`.
    init<S>(rangeConverted source: S)
        where S: FixedWidthInteger, S: UnsignedInteger
    {
        let srcByteCount = MemoryLayout<S>.size
        let dstByteCount = MemoryLayout<Self>.size
        if srcByteCount < dstByteCount {
            let dstBitCount = dstByteCount &* 8
            self = Self(truncatingIfNeeded: source)
            var shifts = srcByteCount &* 8
            while shifts < dstBitCount {
                self |= self << shifts
                shifts = shifts &* 2
            }
        } else if srcByteCount > dstByteCount {
            let shifts = (srcByteCount &- dstByteCount) &* 8
            let a = source >> shifts
            self = Self(truncatingIfNeeded: a)
        } else {
            self = Self(truncatingIfNeeded: source)
        }
    }
}

func demonstrateStrangeWayOfSpeedingUpCode() {
    let num = 100_000_000
    var randomBytes = [UInt8](repeating: 0, count: num)
    for trial in 0 ..< 5 {

        print("#\(trial)")

        // Set new random bytes for each trial:
        randomBytes.withUnsafeMutableBytes { (ptr) -> Void in
            let rc = SecRandomCopyBytes(nil, ptr.count, ptr.baseAddress!)
            precondition(rc == errSecSuccess)
        }

        // ---- The Normal Test ----
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for e in randomBytes {
            let dst = UInt64(rangeConverted: e)
            checksum = checksum ^ dst
        }
        let t1 = CACurrentMediaTime()
        print("  Normal Test:", t1 - t0, "seconds (checksum: \(checksum))")

        // ---- The Faster Test ----
        func fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe() {
            var checksum = UInt64(0)
            let t0 = CACurrentMediaTime()
            for e in randomBytes {
                let dst = UInt64(rangeConverted: e)
                checksum = checksum ^ dst
            }
            let t1 = CACurrentMediaTime()
            print("  Faster Test:", t1 - t0, "seconds (checksum: \(checksum))")
        }
        fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe()
    }
}

demonstrateStrangeWayOfSpeedingUpCode()

Running this on my MBP (I get the same result with latest dev snapshot 2018-02-16):

› swiftc --version
Apple Swift version 4.1 (swiftlang-902.0.38 clang-902.0.30)
Target: x86_64-apple-darwin17.4.0
› swiftc -O -whole-module-optimization -gnone -static-stdlib test.swift 
› ./test
#&#8203;0
  Normal Test: 0.133281479007564 seconds (checksum: 18446744073709551615)
  Faster Test: 0.0559600760461763 seconds (checksum: 18446744073709551615)
#&#8203;1
  Normal Test: 0.127923870983068 seconds (checksum: 13093571283691877813)
  Faster Test: 0.0556459830258973 seconds (checksum: 13093571283691877813)
#&#8203;2
  Normal Test: 0.12858513899846 seconds (checksum: 17144620962624171501)
  Faster Test: 0.0576970489928499 seconds (checksum: 17144620962624171501)
#&#8203;3
  Normal Test: 0.133025638002437 seconds (checksum: 17723342345328784885)
  Faster Test: 0.0553884139517322 seconds (checksum: 17723342345328784885)
#&#8203;4
  Normal Test: 0.127924009982962 seconds (checksum: 1808504320951916825)
  Faster Test: 0.0553643790190108 seconds (checksum: 1808504320951916825)
@belkadan
Copy link
Contributor

cc @eeckstein

@jepers
Copy link
Author

jepers commented Mar 5, 2018

Issue remains in dev snapshot 2018-03-03.

@jepers
Copy link
Author

jepers commented May 27, 2018

Issue remains in dev snapshot 2018-05-26.

@jepers
Copy link
Author

jepers commented May 28, 2018

I noticed that if you remove the for-in loop for the 5 trials, then the normal test will become as fast as the fast test.

@gottesmm
Copy link
Member

I took a little look at this. There is nothing to do with the function being inside the actual function itself. If I pass in the array to the helper function and extract it out, it is as fast (baring a slight function call overhead).

func fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe2(_ randomBytes: [UInt8]) {
  var checksum = UInt64(0)
  let t0 = CACurrentMediaTime()
  for e in randomBytes {
    let dst = UInt64(rangeConverted: e)
    checksum = checksum ^ dst
  }
  let t1 = CACurrentMediaTime()
  print("  Faster Test2:", t1 - t0, "seconds (checksum: \(checksum))")
}

@gottesmm
Copy link
Member

I think for some reason the vectorizer is failing here. Inline we have:

+0x2f3      nopw                %cs:(%rax,%rax)
+0x300          movq                %rbx, %rsi
+0x303          movzbl              32(%r14,%rdx), %ebx
+0x309          movq                %rbx, %rdi
+0x30c          shlq                $8, %rdi
+0x310          orq                 %rbx, %rdi
+0x313          movq                %rdi, %rax
+0x316          shlq                $16, %rax
+0x31a          orq                 %rdi, %rax
+0x31d          movq                %rax, %rbx
+0x320          shlq                $32, %rbx
+0x324          orq                 %rax, %rbx
+0x327          xorq                %rsi, %rbx
+0x32a          cmpq                %rdx, %rcx
+0x32d          je                  "demonstrateStrangeWayOfSpeedingUpCode()+0x342"
+0x32f          incq                %rdx
+0x332          cmpq                %r8, %rdx
+0x335          jb                  "demonstrateStrangeWayOfSpeedingUpCode()+0x300"

In the function that is taken out:

+0x70       pmovzxbq            -2(%rdx), %xmm2
+0x76       pmovzxbq            (%rdx), %xmm3
+0x7b       movdqa              %xmm2, %xmm4
+0x7f       psllq               $8, %xmm4
+0x84       movdqa              %xmm3, %xmm5
+0x88       psllq               $8, %xmm5
+0x8d       por                 %xmm2, %xmm4
+0x91       por                 %xmm3, %xmm5
+0x95       movdqa              %xmm4, %xmm2
+0x99       psllq               $16, %xmm2
+0x9e       movdqa              %xmm5, %xmm3
+0xa2       psllq               $16, %xmm3
+0xa7       por                 %xmm4, %xmm2
+0xab       por                 %xmm5, %xmm3
+0xaf       movdqa              %xmm2, %xmm4
+0xb3       psllq               $32, %xmm4
+0xb8       movdqa              %xmm3, %xmm5
+0xbc       psllq               $32, %xmm5
+0xc1       por                 %xmm2, %xmm4
+0xc5       pxor                %xmm4, %xmm0
+0xc9       por                 %xmm3, %xmm5
+0xcd       pxor                %xmm5, %xmm1
+0xd1       addq                $4, %rdx
+0xd5       addq                $-4, %rsi

@gottesmm
Copy link
Member

The problem is that for some reason when the value is not removed in the other function we can not eliminate a bounds check in the loop. So the vectorization fails.

@gottesmm
Copy link
Member

On a completely different note, I found a really great way to match up cond_fail in assembly with SIL instructions. What you can do is compile with -g -Xfrontend -gsil and then load the program into lldb and diassembly with mixed source. So in this case I was able to see that the two ud2 in the function in question that were interesting was:

** 404   cond_fail %194 : $Builtin.Int1                  // id: %195
   405   %196 = builtin "truncOrBitCast_Int64_Word"(%190 : $Builtin.Int64) : $Builtin.Word // user: %197
   406   %197 = index_addr %180 : $*UInt8, %196 : $Builtin.Word // user: %198

    0x1000011ed <+493>: ud2

** 298   cond_fail %108 : $Builtin.Int1                  // id: %109

    0x1000011ef <+495>: ud2    
    0x1000011f1 <+497>: nopw   %cs:(%rax,%rax)

@jepers
Copy link
Author

jepers commented May 29, 2018

I'm assuming it might have something to with truncatingIfNeeded initializer, just like @atrick speculates about SR-7150 at the end of this comment:

The closest commit I can find in the past year is this, but it's just a wild guess:
commit a5ff35c
Author: Maxim Moiseev <moiseev@apple.com>
Date: Fri Jul 21 14:46:54 2017 -0700

[stdlib] extendingOrTruncating: => truncatingIfNeeded:

cc @moiseev

Note that even though the example code of these two issues (SR-7023 and SR-7150) is similar, they are doing different tests, they use different versions of the rangeConversion-init but for UInt8 to UInt64 they should do the same thing. Anyway, they are perhaps not demonstrating the same issue(s). But I thought the coincident of truncatingIfNeeded coming up in both was probably worth mentioning. Also, something about the following loop resulting in a surprising amount of code and branching was mentioned in the comments to SR-7150:

            while shifts < dstBW {
                self |= self << shifts
                shifts = shifts << 1
            }

or here, in SR-7023:

            while shifts < dstBitCount {
                self |= self << shifts
                shifts = shifts &* 2
            }

@milseman
Copy link
Mannequin

milseman mannequin commented May 29, 2018

That's because of the

 self << shifts

You can instead use

self &<< shifts

to skip the branch. This can make a sizable difference:

public func shift(_ i: UInt, _ by: UInt) -> UInt {
  return i << by
}

public func shift_2(_ i: UInt, _ by: UInt) -> UInt {
  return i &<< by
}

; shift
         cmp        x1, #&#8203;0x40
         cset       w8, gt
         cset       w9, hi
         cmp        x1, #&#8203;0x0
         csel       w8, w9, w8, lt
         cmp        w8, #&#8203;0x1
         b.ne       loc_100007f88
         movz       x0, #&#8203;0x0
         ret
loc_100007f88:
         tbnz       x1, 0x3f, loc_100007fa0
         cmp        x1, #&#8203;0x40
         and        x8, x1, #&#8203;0x3f
         lsl        x8, x0, x8
         csel       x0, x8, xzr, lt
         ret
loc_100007fa0:
         brk        #&#8203;0x1
         brk        #&#8203;0x1

; shift_2
         and        x8, x1, #&#8203;0x3f
         lsl        x0, x0, x8
         ret

edit: code example

@jepers
Copy link
Author

jepers commented May 30, 2018

Yes, but note that if (in the unmodified example program above) we make only this change:

            while shifts < dstBitCount {
                self |= self &<< shifts // <-- Only change: replaced << with &<<
                shifts = shifts &* 2
            }

It makes both tests equally SLOW (not fast as expected).

@gottesmm
Copy link
Member

There is an additional bounds check from the actual loop itself. Consider the following attached swift that is fast in both cases.

@gottesmm
Copy link
Member

So it seems like we have two interesting bugs here where we are not hoisting bounds checks in two situations.

@jepers
Copy link
Author

jepers commented May 30, 2018

Ah, I tried a few different things with your attached test.backup.swift program, I noticed the following:

In my original example program, The Normal Test can become as fast as The Faster Test using only ONE modification to the program, and that is to replace the for-in loop of The Normal Test with a while loop (like you had done for both tests in your modified version of the program). So this is how The Normal Test will look after this single modification:

        // ---- The Normal Test ----
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime() // This line has to be before count = randomBytes.count
        let count = randomBytes.count
        var i = 0
        while i < count {
            let dst = UInt64(rangeConverted: randomBytes[i])
            checksum = checksum ^ dst
            i = i &+ 1
        }
        let t1 = CACurrentMediaTime()
        print("  Normal Test:", t1 - t0, "seconds (checksum: \(checksum))")

Again, note that this is the ONLY change that is needed in order to make both tests equally fast in the original program. So it's not necessary to replace << with &<< for example, and it's not necessary to replace the for-in loop of The Faster Test.

HOWEVER: Note that if we move the first call to CACurrentMediaTime() so that it happens after setting count to randomBytes.count, like this:

        // ---- The Normal Test ----
        var checksum = UInt64(0)
        let count = randomBytes.count
        let t0 = CACurrentMediaTime() // <-- Moved this here (so that it is after count).
        var i = 0
        while i < count {
            let dst = UInt64(rangeConverted: randomBytes[i])
            checksum = checksum ^ dst
            i = i &+ 1
        }
        let t1 = CACurrentMediaTime()
        print("  Normal Test:", t1 - t0, "seconds (checksum: \(checksum))")

then The Normal Test will be slow again!

This situation looks very similar to another issue which I reported some time ago: SR-6254.
Some of the underlying problems seem to be the same in these issues (and if so, the comments at the end of SR-6254 about final class vs struct might perhaps make sense here if Array's count is stored in a dynamically allocated buffer / reference type (together with the elements).

@jepers
Copy link
Author

jepers commented May 30, 2018

Btw, while playing around with the code, I noticed a compiler crash, reported as SR-7805, but it only happens when using a recent snapshot, and -O. Don't know if it might provide some clues.

@jepers
Copy link
Author

jepers commented May 30, 2018

Another observation (please let me know if my amateur investigation is just noise! You could be light years ahead for all I know). This is again a single modification to my original demo program, and this one is making both tests slow:

    init<S>(rangeConverted source: S)
        where S: FixedWidthInteger, S: UnsignedInteger
    {
        let srcByteCount = MemoryLayout<S>.size
        let dstByteCount = MemoryLayout<Self>.size
        if srcByteCount < dstByteCount {
            let dstBitCount = dstByteCount &* 8
            self = Self(truncatingIfNeeded: source)
            var shifts = srcByteCount &* 8
            while shifts < dstBitCount {
                self |= self << shifts
                shifts = shifts &* 2
            }
        } else {
            self = 0
            // As srcByteCount < dstByteCount is always true in this demo,
            // I've commented out the original else-statements below, and
            // replaced them with this one.
            // This change will make both Normal Test and Faster Test slow.
        }
        /*
        else if srcByteCount > dstByteCount {
            let shifts = (srcByteCount &- dstByteCount) &* 8
            let a = source >> shifts
            self = Self(truncatingIfNeeded: a)
        } else {
            self = Self(truncatingIfNeeded: source)
        }
        */
    }

@jepers
Copy link
Author

jepers commented May 31, 2018

Linking to SR-6983, since replacing the for loop with a while loop works around that issue too.

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler The Swift compiler in itself improvement performance
Projects
None yet
Development

No branches or pull requests

3 participants