New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SR-6254] Redundant load elimination + alias analysis opportunity. #48804
Comments
@swift-ci create |
First off, some comments on writing performance critical Swift code. As a developer trying to understand the performance of your low-level At the level of optimization that you want, a certain amount of In this case, manually constant propagating the image bounds into the idivq When the divisor is constand the LLVM code generator expands the Now, regarding how the Swift optimizer handles this benchmark... I To actually force full inlining, I also had to annotate a method in VectorMath.swift: @inline(__always)
func unsafeRemainder(dividingBy other: Self) -> Self Sadly, by itself this does not solve the bottleneck that you're seeing. For reference, here's the LLVM code for the critical loop: ; <label>:386: ; preds = %406, %380
%387 = phi i64 [ %384, %380 ], [ -1, %406 ]
%388 = phi i64 [ %383, %380 ], [ %407, %406 ]
%389 = add i64 %387, %378
%390 = add i64 %388, %379
%391 = srem i64 %389, 123
%392 = srem i64 %390, 234
%393 = add nsw i64 %391, 123
%394 = add nsw i64 %392, 234
%395 = srem i64 %393, 123
%396 = srem i64 %394, 234
%397 = load i64, i64* %.e0._value, align 8
%398 = load i64, i64* %.e1, align 8
%399 = mul i64 %395, %397
%400 = mul i64 %396, %398
%401 = add i64 %400, %399
%402 = load i8*, i8** %._rawValue6, align 8
%403 = getelementptr inbounds i8, i8* %402, i64 %401
%.e065._value = bitcast i8* %403 to float*
%404 = load float, float* %.e065._value, align 4
%405 = fadd float %381, %404
br label %380
; <label>:406: ; preds = %380
%407 = add i64 %383, 1
%408 = icmp slt i64 %407, 2
br i1 %408, label %386, label %409 The only difference in the code with -D MAGIC is the constant operands (123, 234) to the 'srem' instructions: %395 = srem i64 %393, 123
%396 = srem i64 %394, 234 The relevant SIL instructions are: %literal = integer_literal $Builtin.Int64, 123, loc "./main.swift":50:26, scope 4
%bounds = struct $Int (%literal : $Builtin.Int64), loc "./main.swift":56:34, scope 4
%e0_adr = struct_element_addr %114 : $*Vector2<Int>, #Vector2.e0, loc "./Table/Table.swift":26:66, scope 268
%e0_val_adr = struct_element_addr %179 : $*Int, #Int._value, loc "./Table/Table.swift":26:66, scope 268
%e0_val = load %e0_val_adr : $*Builtin.Int64, loc "./Table/Table.swift":26:66, scope 355
%e0 = struct $Int (%e0_val : $Builtin.Int64), loc "./Table/Table.swift":26:61, scope 355
...
store %e0 to %division_arg : $*Int With -D MAGIC, the the redundant load at {{%e0_val}} is manually bypassed at the source level so that we now directly store the literal in {{%bounds}} before calling the division builtin. This is the only significant difference in the optimized SIL: store %bounds to %division_arg : $*Int Although it is very difficult for the optimizer to remove loads with Issue #1: Redundant load elimination during {{Table}} initialization. The "-D KNOWABLE_SIZE -D MAGIC -D DISPEL_MAGIC" case is the first to fix. It involves a short code sequence: let img = Table<FloatLinear2D>(size: sz) // Linear gamma grayscale raster image.
#if DISPEL_MAGIC
img[.zero].e0 = 0 // This magically dispels the magic of magicImgBounds.
#endif
#if MAGIC
let magicImgBounds = img.bounds // See how and why this is used (at only one place) in the timed code below.
#endif I think that stricter alias analysis will be instrumental, and we do have a plan for strengthening alias analysis. Part of the problem here is likely that `img` is a class type so we're loading and storing different elements in the heap. Issue #2: Redundant load elimination in the critical loop. With "-D KNOWABLE_SIZE" (no MAGIC), the optimizer fails to remove for nc in neighborhood.points { But there is a lot of code between Table initialization and the critical loop. The optimizer needs to be able to evaluate side effects and aliasing. I think this will minimally involve: (a) knowing that (b) analyzing all access and calls that involve {{img} to prove that Note that, no matter how well the optimizer does on this particular benchmark, the performance of this code will always be very fragile unless the image bounds is actually an internal global constant that is never modified! |
Thank you for taking the time to write such an informative answer, it is much appreciated! |
If you want to guarantee that your critical loop doesn't perform integer division I can think of a couple ways to do that without relying on the optimizer. Not sure if either is practical:
|
Yes, I am aware of options like that. In the original project, the Vector and Table types are part of a little (internal) library for general data processing. The goal is to make it as fast as possible while at the same time being as general / convenient to use as possible. The pupose of the little library is to make it quick and easy to develop algorithms in an experimental way, testing out different number of dimensions without having to rewrite a lot of boilerplate etc. If an algorithm needs to be faster than what is possible with this little library, we'll write a hard coded version of it, perhaps using Metal, Accelerate or simd, etc. But the library needs to be optimized anyway, because I've concluded that if I'm not actively working to get every part of it as fast as possible (while still being sufficiently general/convenient to use), it quickly falls apart efficiency-wise and becomes so slow that it's useless even for just rapid prototyping and experimentation, ie I don't want to wait an hour for a result that could take a second if I had cared about optimization. Note that I'm actually - sometimes - very impressed/satisfied with the optimizer. For example: The Table type is quite a high level abstraction. It has N-dimansional Vector indices, the Vector types being relatively high level abstractions themselves with generic Element and type level Count etc. And still - at least in some contexts - quite complex and generic code that builds upon it will be optimized into code that is as fast as a corresponding hard coded and optimized C version. Back to the particular issues in this demonstration program: I assumed that if I created a Table instance with a statically knowable size, the optimizer would (inline small methods according to some heuristic and) make use of the fact that bounds is a constant for this instance. That is, looking at the code below, I expected the optimizer to see that bounds is a "constant" computed property since it only depends on constants, namely .zero and size, and .zero is always a statically knowable constant value, and for a Table instance that has been initialized with a size given by integer literals, size is also statically knowable:
It turned out that my expectations was partly met, a static size did indeed result in faster execution times, but (somewhat surprisingly) only if I manually propagated the static bounds constant, and (even more surprisingly) it also turned out to be very easy to accidently ruin even my manual propagation by something seemingly unrelated like writing to the data of the table, something which I can't see how it would change the facts that magicBounds is a let constant - and for that matter - the fact that Table's size is a statically knowable let constant and bounds is a "constant computed property", ie computed only from constants. As I understood your reply, expectations like the ones I just described might possibly be fully met by a future version of the optimizer, but for now, I need to hold its hand quite a bit in some situations, and try to learn and keep up with its sometimes strange and changing ways. I still don't quite understand how writing to the Table instance's (img's pixel) data can make the optimizer forget that size and bounds are constant. Will read your explanation again. |
Regarding the optimizer's failure to recognize There's a more direct, less fragile approach to solving this problem called LetPropertiesOpt. Currently, that only works for class properties that are statically initialized to a constant (they may as well be static properties). All occurrences are replaced by that constant. I think it would be reasonable to extend that optimization with a flow-sensitive analysis that can handle objects that are initialized with a constant and accessed locally. That's also somewhat fragile because object creation needs to be visible from all of its accesses, but that seems to be one of the main things that you want. |
Thanks again for taking the time to explain these things! As a general workaround (until the optimizer gets better), do you think it would be better to write a type like Table as a struct (instead of a final class), but still with reference semantics? That is:
A Table is (for our purposes) a resource with identity, and it wouldn't make sense to have it be a value type with copy on write, at least not in most cases, so it felt like a natural fit for a class type, and writing it as a struct with reference semantics is kind of unorthodox I guess, but I'm not sure I can see any real down-sides if it would mean that the kind of optimizations we're talking about here would be more likely / predictable? |
I would say yes, it's better to use a struct, but if you ever pass that Table<Data> around, or store it, as an existential (protocol type), then it might result in extra heap allocation. The existential buffer is just three words, and you have at least four words or much more depending on the number of dimensions. So, I would say it's worth a try but ultimately depends on how the Table is being used. By the way, we should eventually expose an API for tail-allocated class storage so you don't need the double indirection to get at the table data. |
Great, thanks! |
To be clear, your original intuition is generally correct. Make it a class if it has class-like semantics. I was only speaking to the possibility of optimizing away the `size` property as a compile-time constant. In certain circumstances, the optimizer may have an easier time "seeing through" the struct initialization. |
Yes, understood. |
Attachment: Download
Environment
Recent development snapshots (at least 2017-10-26 … 29), macOS 10.13.
Additional Detail from JIRA
md5: c127c1be23378239d82633a19201e410
relates to:
Issue Description:
EDIT 1: I have attached an updated version of the code so that it will compile without warnings with snapshot 2017-12-03 (and also added a // NOTE about the need to use @inline(__always) in Table.swift).
EDIT 2: Attached another version of the demonstration program updated to compile with dev snapshot 2018-03-03.
The file below is part of the program in the attachment (the updated one) and explains the issue:
main.swift
So the demonstration is meant to compile without errors and produce images like the one contained in the attached zip file. And the expected result (assuming these really are issues and not expected behaviors) is that the compiler should be able to optimize this program (so that time is 2 rather than 8) with -D KNOWABLE_SIZE but without needing -D MAGIC.
(I first encountered these issues within Xcode and I could've attached an Xcode project or Swift package but I think the risk of accidentally introducing unrelated issues might be reduced by only attaching the source files and a simple build.sh to compile them.)
The text was updated successfully, but these errors were encountered: