Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-6254] Redundant load elimination + alias analysis opportunity. #48804

Open
jepers opened this issue Oct 30, 2017 · 11 comments
Open

[SR-6254] Redundant load elimination + alias analysis opportunity. #48804

jepers opened this issue Oct 30, 2017 · 11 comments
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. compiler The Swift compiler in itself performance

Comments

@jepers
Copy link

jepers commented Oct 30, 2017

Previous ID SR-6254
Radar rdar://problem/35301451
Original Reporter @jepers
Type Bug

Attachment: Download

Environment

Recent development snapshots (at least 2017-10-26 … 29), macOS 10.13.

Additional Detail from JIRA
Votes 0
Component/s Compiler
Labels Bug, Performance
Assignee None
Priority Medium

md5: c127c1be23378239d82633a19201e410

relates to:

  • SR-7023 Missed optimization opportunity, but not missed if wrapping code in a nested func

Issue Description:

EDIT 1: I have attached an updated version of the code so that it will compile without warnings with snapshot 2017-12-03 (and also added a // NOTE about the need to use @inline(__always) in Table.swift).

EDIT 2: Attached another version of the demonstration program updated to compile with dev snapshot 2018-03-03.

The file below is part of the program in the attachment (the updated one) and explains the issue:

main.swift

//
// This main.swift file is a demonstration of some IMHO strange
// ways to make the program faster / slower.
//
// NOTE: Don't try to compile this with anything earlier than
// DEVELOPMENT-SNAPSHOT-2017-10-29, since you'll probably just
// get a lot of errors not related to the issues meant
// to be demonstrated here.
//
// Please read the code and comments of this file to get an idea
// about what the conditional compilation flags do, and look at
// the following (which is when using my MBP, macOS 10.13):
//
// $ ./build.sh && ./Dlapg
// time: 8.04189954401227 seconds
//
// $ ./build.sh -D KNOWABLE_SIZE && ./Dlapg
// time: 8.15726826502942 seconds
//
// $ ./build.sh -D KNOWABLE_SIZE -D MAGIC && ./Dlapg
// time: 2.39418747503078 seconds
//
// $ ./build.sh -D KNOWABLE_SIZE -D MAGIC -D DISPEL_MAGIC && ./Dlapg
// time: 8.13506004196825 seconds
//
// I think I can understand why -D KNOWABLE_SIZE could make it faster,
// (although I am interested in if it really should in this case), but
// I do not at all understand why -D MAGIC is needed and why
// -D DISPEL_MAGIC dispels the magic.
// 
// Also see the NOTE in Table.swift about the need to use
// @inline(__always) for the two initializers there.
//

import QuartzCore // Only used for CACurrentMediaTime()


func createRandomDiffusionLimitedAggregationPatternImage() {
    // A pseudo random number generator seeded with value from SecRandomCopyBytes:
    let rg = Xoroshiro128Plus() 

    #if KNOWABLE_SIZE
        // Statically knowable image size, seems to make some optimizations possible BUT ...
        // ... only if compiling with -D MAGIC!
        let sz = Vector2(123, 234)
    #else
        // Statically unknowable image size, seems to make some optimizations impossible:
        let sz = (Vector2(120, 230) ..<& Vector2(130, 240)).randomPoint(using: rg)
    #endif

    let img = Table<FloatLinear2D>(size: sz) // Linear gamma grayscale raster image.
    #if DISPEL_MAGIC
        img[.zero].e0 = 0 // This magically dispels the magic of magicImgBounds.
    #endif
    #if MAGIC
        let magicImgBounds = img.bounds // See how and why this is used (at only one place) in the timed code below.
    #endif

    // Set all pixels to black:
    for c in img.coordinates { img[c].e0 = 0.0 }
    // Set a couple of pixels to white, to have something for the "corals" to grow on:
    for _ in 0 ..< 7 {
        img[img.bounds.randomPoint(using: rg)].e0 = 1.0 
    }
    // The following VectorRange is the 3x3 two-dim closed range (-1, -1) to (1, 1), but
    // since VectorRange always interprets its upperBound as exclusive we use (2, 2):
    let neighborhood = VectorRange(uncheckedBounds: (Vector2<Int>(-1, -1), Vector2<Int>(2, 2)))
    
    // Generate the coral pattern by random walking until hitting a white pixel, and also
    // measure and report the time it takes:
    let t0 = CACurrentMediaTime()
    for _ in 0 ..< 500_000 {
        var c = img.bounds.randomPoint(using: rg) // (magicImgBounds has no effect here.)
        for i in 0 ..< 20_000 {
            let dir = neighborhood.randomPoint(using: rg)
            c = (c &+ dir).periodicallyBounded(by: img.bounds) // (magicImgBounds has no effect here.)
            var sum = Float(0)
            for nc in neighborhood.points {
                #if MAGIC
                    let gc = (c &+ nc).periodicallyBounded(by: magicImgBounds)
                #else
                    let gc = (c &+ nc).periodicallyBounded(by: img.bounds)
                #endif
                sum = sum + img[gc].e0
            }
            if sum > 0.0 {
                if sum == 1.0 && i > 500 { img[c].e0 = 1.0 }
                break
            }
        }
    }
    let t1 = CACurrentMediaTime()
    print("time:", t1 - t0, "seconds")

    // Repeat the generated pattern as 3 x 3 tiles and save the result as a png file:    
    let img2 = type(of: img).init(size: img.size * 3)
    for c in img2.coordinates {
        img2[c] = img[c.periodicallyBounded(by: img.bounds)]
    }
    img2.saveAsPng(path: "pattern.png")
}

createRandomDiffusionLimitedAggregationPatternImage()

So the demonstration is meant to compile without errors and produce images like the one contained in the attached zip file. And the expected result (assuming these really are issues and not expected behaviors) is that the compiler should be able to optimize this program (so that time is 2 rather than 8) with -D KNOWABLE_SIZE but without needing -D MAGIC.

(I first encountered these issues within Xcode and I could've attached an Xcode project or Swift package but I think the risk of accidentally introducing unrelated issues might be reduced by only attaching the source files and a simple build.sh to compile them.)

@belkadan
Copy link
Contributor

belkadan commented Nov 1, 2017

@swift-ci create

@atrick
Copy link
Member

atrick commented Mar 7, 2018

First off, some comments on writing performance critical Swift code.

As a developer trying to understand the performance of your low-level
code, I suggest first using Instruments or your favorite Linux
profiler. Build with '-g' if the correlation between disassembly and
source isn't obvious. If you find yourself trying to understand the
disassembly in detail, a fancier disassembler like Hopper can help.

At the level of optimization that you want, a certain amount of
hand-optimization is to be expected. Most of the problems that you're
running into are the same problems that you would face with an
equivalent amount of abstraction in C++. The difference is that it's
still much harder to control the abstraction in Swift and give hints
to the optimizer. The good news is that it's not a fundamental
limitation of the Swift language and major changes are underway to
make high performance programming viable. The bad news is that it will
take a long time before enough support is in place to make Swift a fun
performance language--it's not just a matter of fixing a few optimizer
bugs.

In this case, manually constant propagating the image bounds into the
critical loop is hugely beneficial. Normally, the division inside
"unsafeRemainder" generates a regular integer divide:

idivq

When the divisor is constand the LLVM code generator expands the
integer division into an integer multiply (imulq) and a couple of right
shifts. This is close to 10x faster!

Now, regarding how the Swift optimizer handles this benchmark... I
debugged it and fixed what I consider to be the first-order problem:
certain cases of closures inside generic protocol methods were not
being inlined: PR 15027 "closure inlining after witness
devirtualization".

To actually force full inlining, I also had to annotate a method in VectorMath.swift:

@inline(__always)
 func unsafeRemainder(dividingBy other: Self) -> Self

Sadly, by itself this does not solve the bottleneck that you're seeing.

For reference, here's the LLVM code for the critical loop:

; <label>:386: ; preds = %406, %380
 %387 = phi i64 [ %384, %380 ], [ -1, %406 ]
 %388 = phi i64 [ %383, %380 ], [ %407, %406 ]
 %389 = add i64 %387, %378
 %390 = add i64 %388, %379
 %391 = srem i64 %389, 123
 %392 = srem i64 %390, 234
 %393 = add nsw i64 %391, 123
 %394 = add nsw i64 %392, 234
 %395 = srem i64 %393, 123
 %396 = srem i64 %394, 234
 %397 = load i64, i64* %.e0._value, align 8
 %398 = load i64, i64* %.e1, align 8
 %399 = mul i64 %395, %397
 %400 = mul i64 %396, %398
 %401 = add i64 %400, %399
 %402 = load i8*, i8** %._rawValue6, align 8
 %403 = getelementptr inbounds i8, i8* %402, i64 %401
 %.e065._value = bitcast i8* %403 to float*
 %404 = load float, float* %.e065._value, align 4
 %405 = fadd float %381, %404
 br label %380

; <label>:406: ; preds = %380
 %407 = add i64 %383, 1
 %408 = icmp slt i64 %407, 2
 br i1 %408, label %386, label %409

The only difference in the code with -D MAGIC is the constant operands (123, 234) to the 'srem' instructions:

 %395 = srem i64 %393, 123
 %396 = srem i64 %394, 234

The relevant SIL instructions are:

 %literal = integer_literal $Builtin.Int64, 123, loc "./main.swift":50:26, scope 4
 %bounds = struct $Int (%literal : $Builtin.Int64), loc "./main.swift":56:34, scope 4

%e0_adr = struct_element_addr %114 : $*Vector2<Int>, #Vector2.e0, loc "./Table/Table.swift":26:66, scope 268

%e0_val_adr = struct_element_addr %179 : $*Int, #Int._value, loc "./Table/Table.swift":26:66, scope 268

%e0_val = load %e0_val_adr : $*Builtin.Int64, loc "./Table/Table.swift":26:66, scope 355
 %e0 = struct $Int (%e0_val : $Builtin.Int64), loc "./Table/Table.swift":26:61, scope 355

...
 store %e0 to %division_arg : $*Int

With -D MAGIC, the the redundant load at {{%e0_val}} is manually bypassed at the source level so that we now directly store the literal in {{%bounds}} before calling the division builtin.

This is the only significant difference in the optimized SIL:

store %bounds to %division_arg : $*Int

Although it is very difficult for the optimizer to remove loads with
arbitrary calls and writes in between, I think it's worth leaving this
bug open as an opportunity for better alias analysis and redundant
load elimination.

Issue #1: Redundant load elimination during {{Table}} initialization.

The "-D KNOWABLE_SIZE -D MAGIC -D DISPEL_MAGIC" case is the first to fix. It involves a short code sequence:

 let img = Table<FloatLinear2D>(size: sz) // Linear gamma grayscale raster image.
 #if DISPEL_MAGIC
 img[.zero].e0 = 0 // This magically dispels the magic of magicImgBounds.
 #endif
 #if MAGIC
 let magicImgBounds = img.bounds // See how and why this is used (at only one place) in the timed code below.
 #endif

I think that stricter alias analysis will be instrumental, and we do have a plan for strengthening alias analysis. Part of the problem here is likely that `img` is a class type so we're loading and storing different elements in the heap.

Issue #2: Redundant load elimination in the critical loop.

With "-D KNOWABLE_SIZE" (no MAGIC), the optimizer fails to remove
a redundant loads in a tight loop that does nothing but call Builtin
arithmetic operations.

 for nc in neighborhood.points {

But there is a lot of code between Table initialization and the critical loop. The optimizer needs to be able to evaluate side effects and aliasing.

I think this will minimally involve:

(a) knowing that img cannot escape (which isn't realistic in real code)

(b) analyzing all access and calls that involve {{img} to prove that size cannot be modified.

Note that, no matter how well the optimizer does on this particular benchmark, the performance of this code will always be very fragile unless the image bounds is actually an internal global constant that is never modified!

@jepers
Copy link
Author

jepers commented Mar 7, 2018

Thank you for taking the time to write such an informative answer, it is much appreciated!

@atrick
Copy link
Member

atrick commented Mar 7, 2018

If you want to guarantee that your critical loop doesn't perform integer division I can think of a couple ways to do that without relying on the optimizer. Not sure if either is practical:

  • Process an image in fixed size blocks, or fixed size ranges and use a global let to hold that size.

  • Process ranges in power-of-two sizes and manually shift right rather than calling the unsafeRemainder api.

@jepers
Copy link
Author

jepers commented Mar 7, 2018

Yes, I am aware of options like that. In the original project, the Vector and Table types are part of a little (internal) library for general data processing. The goal is to make it as fast as possible while at the same time being as general / convenient to use as possible.

The pupose of the little library is to make it quick and easy to develop algorithms in an experimental way, testing out different number of dimensions without having to rewrite a lot of boilerplate etc. If an algorithm needs to be faster than what is possible with this little library, we'll write a hard coded version of it, perhaps using Metal, Accelerate or simd, etc.

But the library needs to be optimized anyway, because I've concluded that if I'm not actively working to get every part of it as fast as possible (while still being sufficiently general/convenient to use), it quickly falls apart efficiency-wise and becomes so slow that it's useless even for just rapid prototyping and experimentation, ie I don't want to wait an hour for a result that could take a second if I had cared about optimization.

Note that I'm actually - sometimes - very impressed/satisfied with the optimizer. For example: The Table type is quite a high level abstraction. It has N-dimansional Vector indices, the Vector types being relatively high level abstractions themselves with generic Element and type level Count etc. And still - at least in some contexts - quite complex and generic code that builds upon it will be optimized into code that is as fast as a corresponding hard coded and optimized C version.

Back to the particular issues in this demonstration program:

I assumed that if I created a Table instance with a statically knowable size, the optimizer would (inline small methods according to some heuristic and) make use of the fact that bounds is a constant for this instance.

That is, looking at the code below, I expected the optimizer to see that bounds is a "constant" computed property since it only depends on constants, namely .zero and size, and .zero is always a statically knowable constant value, and for a Table instance that has been initialized with a size given by integer literals, size is also statically knowable:

final class Table<Data: TableData> {
    // ...

    /// The N-dimensional size (number of elements in each direction).
    let size: Data.Coordinate

    /// The N-dimensional byte-stride (number of bytes between
    /// elements in a row, rows in a plane, planes in a …)
    let stride: Data.Coordinate
    
    var bounds: VectorRange<Data.Coordinate> { return .zero ..<& size }
    var coordinates: VectorRangePoints<Data.Coordinate> {
        return bounds.unsafePoints
    }
    // ...
}    

It turned out that my expectations was partly met, a static size did indeed result in faster execution times, but (somewhat surprisingly) only if I manually propagated the static bounds constant, and (even more surprisingly) it also turned out to be very easy to accidently ruin even my manual propagation by something seemingly unrelated like writing to the data of the table, something which I can't see how it would change the facts that magicBounds is a let constant - and for that matter - the fact that Table's size is a statically knowable let constant and bounds is a "constant computed property", ie computed only from constants.

As I understood your reply, expectations like the ones I just described might possibly be fully met by a future version of the optimizer, but for now, I need to hold its hand quite a bit in some situations, and try to learn and keep up with its sometimes strange and changing ways.

I still don't quite understand how writing to the Table instance's (img's pixel) data can make the optimizer forget that size and bounds are constant. Will read your explanation again.

@atrick
Copy link
Member

atrick commented Mar 7, 2018

Regarding the optimizer's failure to recognize size as a constant... I mentioned redundant load elimination and alias analysis. Those are aggressive attempts to prove that a value doesn't change within a region of code, and fairly fragile optimizations. I think it's possible in your case, but the fact that size is stored in a class makes it a little more difficult. Class storage is inherently mutable, so the analysis needs to prove that the class reference hasn't escaped. Now, in your case, knowledge that the property is a `let` means that a lot of this analysis could be skipped. It seems obvious on the surface, but making all the optimizer pieces do the right thing for `lets` is nontrivial because none of those pieces are aware of when `let` initialization is taking place.

There's a more direct, less fragile approach to solving this problem called LetPropertiesOpt. Currently, that only works for class properties that are statically initialized to a constant (they may as well be static properties). All occurrences are replaced by that constant. I think it would be reasonable to extend that optimization with a flow-sensitive analysis that can handle objects that are initialized with a constant and accessed locally. That's also somewhat fragile because object creation needs to be visible from all of its accesses, but that seems to be one of the main things that you want.

@jepers
Copy link
Author

jepers commented Mar 8, 2018

Thanks again for taking the time to explain these things!

As a general workaround (until the optimizer gets better), do you think it would be better to write a type like Table as a struct (instead of a final class), but still with reference semantics? That is:

// Storage would still be a class:
final class TableStorage {
    let data: UnsafeMutableRawPointer
    // ...
}

// But Table would be a struct with ref semantics:
struct Table<Data: TableData> {
    let storage: TableStorage // This is still a possibly shared storage.

    /// A pointer to the first byte of the first element.
    let baseAddress: UnsafeMutableRawPointer

    /// The N-dimensional size (number of elements in each direction).
    let size: Data.Coordinate

    /// The N-dimensional byte-stride (number of bytes between
    /// elements in a row, rows in a plane, planes in a …)
    let stride: Data.Coordinate
    
    var bounds: VectorRange<Data.Coordinate> { return .zero ..<& size }
    var coordinates: VectorRangePoints<Data.Coordinate> {
        return bounds.unsafePoints
    }
    // ...
}

A Table is (for our purposes) a resource with identity, and it wouldn't make sense to have it be a value type with copy on write, at least not in most cases, so it felt like a natural fit for a class type, and writing it as a struct with reference semantics is kind of unorthodox I guess, but I'm not sure I can see any real down-sides if it would mean that the kind of optimizations we're talking about here would be more likely / predictable?

@atrick
Copy link
Member

atrick commented Mar 8, 2018

I would say yes, it's better to use a struct, but if you ever pass that Table<Data> around, or store it, as an existential (protocol type), then it might result in extra heap allocation. The existential buffer is just three words, and you have at least four words or much more depending on the number of dimensions.

So, I would say it's worth a try but ultimately depends on how the Table is being used.

By the way, we should eventually expose an API for tail-allocated class storage so you don't need the double indirection to get at the table data.

@jepers
Copy link
Author

jepers commented Mar 9, 2018

Great, thanks!

@atrick
Copy link
Member

atrick commented Mar 9, 2018

To be clear, your original intuition is generally correct. Make it a class if it has class-like semantics. I was only speaking to the possibility of optimizing away the `size` property as a compile-time constant. In certain circumstances, the optimizer may have an easier time "seeing through" the struct initialization.

@jepers
Copy link
Author

jepers commented Mar 9, 2018

Yes, understood.
(btw, just in case you'd be interested in having a look at my latest stumbling block: https://bugs.swift.org/browse/SR-7150 )

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. compiler The Swift compiler in itself performance
Projects
None yet
Development

No branches or pull requests

3 participants