New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SR-7023] Missed optimization opportunity, but not missed if wrapping code in a nested func #49571
Comments
cc @eeckstein |
Issue remains in dev snapshot 2018-03-03. |
Issue remains in dev snapshot 2018-05-26. |
I noticed that if you remove the for-in loop for the 5 trials, then the normal test will become as fast as the fast test. |
I took a little look at this. There is nothing to do with the function being inside the actual function itself. If I pass in the array to the helper function and extract it out, it is as fast (baring a slight function call overhead). func fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe2(_ randomBytes: [UInt8]) {
var checksum = UInt64(0)
let t0 = CACurrentMediaTime()
for e in randomBytes {
let dst = UInt64(rangeConverted: e)
checksum = checksum ^ dst
}
let t1 = CACurrentMediaTime()
print(" Faster Test2:", t1 - t0, "seconds (checksum: \(checksum))")
} |
I think for some reason the vectorizer is failing here. Inline we have: +0x2f3 nopw %cs:(%rax,%rax)
+0x300 movq %rbx, %rsi
+0x303 movzbl 32(%r14,%rdx), %ebx
+0x309 movq %rbx, %rdi
+0x30c shlq $8, %rdi
+0x310 orq %rbx, %rdi
+0x313 movq %rdi, %rax
+0x316 shlq $16, %rax
+0x31a orq %rdi, %rax
+0x31d movq %rax, %rbx
+0x320 shlq $32, %rbx
+0x324 orq %rax, %rbx
+0x327 xorq %rsi, %rbx
+0x32a cmpq %rdx, %rcx
+0x32d je "demonstrateStrangeWayOfSpeedingUpCode()+0x342"
+0x32f incq %rdx
+0x332 cmpq %r8, %rdx
+0x335 jb "demonstrateStrangeWayOfSpeedingUpCode()+0x300" In the function that is taken out: +0x70 pmovzxbq -2(%rdx), %xmm2
+0x76 pmovzxbq (%rdx), %xmm3
+0x7b movdqa %xmm2, %xmm4
+0x7f psllq $8, %xmm4
+0x84 movdqa %xmm3, %xmm5
+0x88 psllq $8, %xmm5
+0x8d por %xmm2, %xmm4
+0x91 por %xmm3, %xmm5
+0x95 movdqa %xmm4, %xmm2
+0x99 psllq $16, %xmm2
+0x9e movdqa %xmm5, %xmm3
+0xa2 psllq $16, %xmm3
+0xa7 por %xmm4, %xmm2
+0xab por %xmm5, %xmm3
+0xaf movdqa %xmm2, %xmm4
+0xb3 psllq $32, %xmm4
+0xb8 movdqa %xmm3, %xmm5
+0xbc psllq $32, %xmm5
+0xc1 por %xmm2, %xmm4
+0xc5 pxor %xmm4, %xmm0
+0xc9 por %xmm3, %xmm5
+0xcd pxor %xmm5, %xmm1
+0xd1 addq $4, %rdx
+0xd5 addq $-4, %rsi |
The problem is that for some reason when the value is not removed in the other function we can not eliminate a bounds check in the loop. So the vectorization fails. |
On a completely different note, I found a really great way to match up cond_fail in assembly with SIL instructions. What you can do is compile with -g -Xfrontend -gsil and then load the program into lldb and diassembly with mixed source. So in this case I was able to see that the two ud2 in the function in question that were interesting was: ** 404 cond_fail %194 : $Builtin.Int1 // id: %195
405 %196 = builtin "truncOrBitCast_Int64_Word"(%190 : $Builtin.Int64) : $Builtin.Word // user: %197
406 %197 = index_addr %180 : $*UInt8, %196 : $Builtin.Word // user: %198
0x1000011ed <+493>: ud2
** 298 cond_fail %108 : $Builtin.Int1 // id: %109
0x1000011ef <+495>: ud2
0x1000011f1 <+497>: nopw %cs:(%rax,%rax) |
I'm assuming it might have something to with truncatingIfNeeded initializer, just like @atrick speculates about SR-7150 at the end of this comment:
cc @moiseev Note that even though the example code of these two issues (SR-7023 and SR-7150) is similar, they are doing different tests, they use different versions of the rangeConversion-init but for UInt8 to UInt64 they should do the same thing. Anyway, they are perhaps not demonstrating the same issue(s). But I thought the coincident of truncatingIfNeeded coming up in both was probably worth mentioning. Also, something about the following loop resulting in a surprising amount of code and branching was mentioned in the comments to SR-7150:
or here, in SR-7023:
|
That's because of the
You can instead use
to skip the branch. This can make a sizable difference:
edit: code example |
Yes, but note that if (in the unmodified example program above) we make only this change:
It makes both tests equally SLOW (not fast as expected). |
There is an additional bounds check from the actual loop itself. Consider the following attached swift that is fast in both cases. |
So it seems like we have two interesting bugs here where we are not hoisting bounds checks in two situations. |
Ah, I tried a few different things with your attached test.backup.swift program, I noticed the following: In my original example program, The Normal Test can become as fast as The Faster Test using only ONE modification to the program, and that is to replace the for-in loop of The Normal Test with a while loop (like you had done for both tests in your modified version of the program). So this is how The Normal Test will look after this single modification:
Again, note that this is the ONLY change that is needed in order to make both tests equally fast in the original program. So it's not necessary to replace << with &<< for example, and it's not necessary to replace the for-in loop of The Faster Test. HOWEVER: Note that if we move the first call to CACurrentMediaTime() so that it happens after setting count to randomBytes.count, like this:
then The Normal Test will be slow again! This situation looks very similar to another issue which I reported some time ago: SR-6254. |
Btw, while playing around with the code, I noticed a compiler crash, reported as SR-7805, but it only happens when using a recent snapshot, and -O. Don't know if it might provide some clues. |
Another observation (please let me know if my amateur investigation is just noise! You could be light years ahead for all I know). This is again a single modification to my original demo program, and this one is making both tests slow:
|
Linking to SR-6983, since replacing the for loop with a while loop works around that issue too. |
Attachment: Download
Additional Detail from JIRA
md5: d1719415de47fefc0d5b1f41eec99e68
relates to:
Issue Description:
As discussed in more detail here, the following code demonstrates that the execution speed of some code increases when simply wrapping it in a nested function, suggesting a missed optimization in the natural (not-wrapped-in-nested-func) "Normal Test" below.
Running this on my MBP (I get the same result with latest dev snapshot 2018-02-16):
The text was updated successfully, but these errors were encountered: