Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-9223] Inliner exhibits slow compilation time with a large static array #51712

Closed
swift-ci opened this issue Nov 12, 2018 · 32 comments
Closed
Assignees
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. compiler The Swift compiler in itself duplicate Resolution: Duplicates another issue expressions Feature: expressions literals Feature → expressions: Literals such as an integer or string literal optimized only Flag: An issue whose reproduction requires optimized compilation performance SILOptimizer Area → compiler: SIL optimization passes swift 5.8

Comments

@swift-ci
Copy link
Collaborator

swift-ci commented Nov 12, 2018

Previous ID SR-9223
Radar None
Original Reporter andreasw (JIRA User)
Type Bug
Status Reopened
Resolution

Attachment: Download

Environment

Makefile

Additional Detail from JIRA
Votes 0
Component/s
Labels Bug
Assignee @atrick
Priority Medium

md5: c44b8d1ce0417361133fc33640032956

Issue Description:

Hi!

Please see the attached Makefile and perf report. This time it's the inliner (and verifier).

With 5.000 elements compiling takes less than 3 seconds, with 10.000 more than 15, with 15.000 37 seconds.

It seems that commit bd28b0e (SILCloner and SILInliner rewrite) is the culprit.

+ 71,47% 0,06% swift swift [.] runOnFunctionRecursively ◆
+ 71,29% 0,01% swift swift [.] swift::SILInliner::inlineFunction ▒
+ 71,22% 0,02% swift swift [.] swift::SILInlineCloner::cloneInline ▒

-   70,39% 70,19% swift swift [.] llvm::ilist_traits<swift::SILInstruction>::transferNodesFromList ▒
    70,19% runOnFunctionRecursively ▒
    swift::SILInliner::inlineFunction ▒
    + swift::SILInlineCloner::cloneInline ▒
    + 35,74% 0,01% swift swift [.] swift::mergeBasicBlockWithSuccessor ▒
    + 35,69% 0,01% swift swift [.] swift::mergeBasicBlockWithSingleSuccessor ▒
    + 35,63% 0,03% swift swift [.] swift::SILBasicBlock::spliceAtEnd ▒
    + 34,80% 0,04% swift swift [.] swift::SILBasicBlock::split ▒
    + 20,74% 0,04% swift swift [.] swift::SILFunction::verify ▒
    + 20,38% 0,26% swift swift [.] (anonymous namespace)::SILVerifier::visitSILBasicBlock ▒
    + 20,07% 0,13% swift swift [.] swift::SILInstructionVisitor<(anonymous namespace)::SILVerifier, void>::visit ▒
    + 19,46% 18,37% swift swift [.] (anonymous namespace)::SILVerifier::visitSILInstruction ▒
    + 17,22% 0,00% swift swift [.] swift::SILModule::verify

The functions inlined are from the builtin UInt and Array classes.

@atrick
Copy link
Member

atrick commented Nov 13, 2018

Please send me a test case.

@swift-ci
Copy link
Collaborator Author

swift-ci commented Nov 13, 2018

Comment by Andreas Wendleder (JIRA)

The Makefile is the test case. It basically generates a swift file that looks like this:

let x: [UInt] = [1, 2, 3, 4, 5, ..., 4999, 5000] and compiles it.

@swift-ci
Copy link
Collaborator Author

swift-ci commented Nov 15, 2018

Comment by Andreas Wendleder (JIRA)

Here are some timings and a diagram that shows quadratic behaviour:

https://docs.google.com/spreadsheets/d/16VakcOYvcwZbmPRLxO_1HcLUwCx0xEVT5mXWYsYnQjY/edit?usp=sharing

which I think comes from SILInstruction.cpp +88:

// Update the parent fields in the instructions.  
for (; first != last; ++first)  
first->ParentBB = ThisParent;

I have a static array with >50.000 elements which takes half an hour to compile. With C++ it takes 0.3 seconds.

@atrick
Copy link
Member

atrick commented Nov 15, 2018

That's right. Merging two basic blocks is actually O[n] because of the parent pointer update that happens. I changed the SILCloner API to simplify cloning block terminators, then worked around it in the inliner by merging the block, but that was a mistake. Now I'm adding back the ability of the SILCloner to insert cloned instructions in the middle of a block.

@atrick
Copy link
Member

atrick commented Nov 16, 2018

PR: #20630

@swift-ci
Copy link
Collaborator Author

swift-ci commented Nov 16, 2018

Comment by Andreas Wendleder (JIRA)

Better but still quadratic, see the updated diagram at http://bit.ly/2Dqiwx1.

This time the culprit is SILBasicBlock::split which is called n times and is itself O[n].

Here is the stack trace:

+ 87,15% 0,09% swift swift [.] runOnFunctionRecursively ◆
+ 87,01% 0,01% swift swift [.] swift::SILInliner::inlineFunction ▒
+ 86,94% 0,02% swift swift [.] swift::SILInlineCloner::cloneInline ▒
+ 85,82% 85,44% swift swift [.] llvm::ilist_traits<swift::SILInstruction>::transferNodesFrom▒
+ 85,81% 0,05% swift swift [.] swift::SILBasicBlock::split

I think instead of repeatedly splitting the existing SILBasicBlock it would be better to generate a new one and copy/clone everything into that:

MandatoryInlining.cpp:535:

for (auto BI = F->begin(), BE = F->end(); BI != BE; ++BI) {
    // ~~While iterating over this block, instructions are inserted and deleted.~~

    // Every instruction is copied and every function cloned into this basic block:

    auto BB = F->createBasicBlock();
    for (auto II = BI->begin(), nextI = II; II != BI->end(); II = nextI) {

        if (Callee)

            inlineFunctionInto(Callee, BB);

        else

            copyInstructionInto(BB);

    replace(BI, BB);

    delete(BB);

It uses a little more memory but is O[n]

@atrick
Copy link
Member

atrick commented Nov 16, 2018

That's right. I deviated from the original goal last night and explored making the CFG more canonical in general. I think that will be a good thing, but back to your problem...

I see two solutions:

(1) Reintroduce SILInliner's special case for single-block callee functions to avoid splitting. This simply skips the `split` operation when it detects a single block callee. This solves the problem of static array initialization. But the complexity bleeds into the SILCloner and SILInliner which need to handle this special case all over the place. More importantly, it doesn't solve the problem that the SILInliner is quadratic in general (and always has been).

I think what you're showing above is an alternate implementation of this special case where we emit into a fake block, then move the cloned instructions out of the block after inlining.

(2) Run the SILInliner in reverse.

See PR: #20630.

This simplifies everything quite a bit.

I just got this building, haven't tested or profiled it yet, and still need to figure out how to create a compile time test case for the quadratic behavior.

@swift-ci
Copy link
Collaborator Author

Comment by Andreas Wendleder (JIRA)

Cool.

This solves my problem with the test case, see the above updated diagram (except for a SEGV with more than 70.000 elements).

I'm still compiling my "real" program and am thinking of the best solution.

Thanks,

Andreas

@swift-ci
Copy link
Collaborator Author

Comment by Andreas Wendleder (JIRA)

I can now also confirm that this fixes my application for non-optimising builds.

Compilation times went from 9 minutes to 13 seconds.

So for me this issue is solved, except for two things:

  1. The SEGV for large matrices.

  2. I still think copying/inlining into a new basic block would be the simpler solution, but that's up to you.

Thanks anyway for your time and dedication,

Andreas

@swift-ci
Copy link
Collaborator Author

Comment by Andreas Wendleder (JIRA)

Two additional thoughts:

  1. A validation test a la 3e25467 would be nice to trigger quadratic behaviour in the future.

  2. A validation test a la https://github.com/apple/swift/blob/master/validation-test/Sema/type_checker_perf/fast/rdar18522024.swift would be even nicer to trigger compilation times above a second for, say, large arrays with more than 10.000 elements, because this is the fourth or fifth bug I have to deal with with Swift concerning large arrays, involving the constraint system, the inliner, redundant elimination et al.

@atrick
Copy link
Member

atrick commented Nov 17, 2018

> The SEGV for large matrices.

Does this refer to another bug? The RLE bug is the only one I saw.

> I still think copying/inlining into a new basic block would be the simpler solution, but that's up to you.

The problem isn't the inlined instructions, it's the instructions after the call in the same block. In general, they all need to be moved to a new block which means adjusting all their parent pointers. Those moved instructions may themselves be calls that need to be inlined, so we end up moving each original instruction number-of-calls times. Inlining in reverse means we never need to revisit an instruction that was already moved.

@swift-ci
Copy link
Collaborator Author

Comment by Andreas Wendleder (JIRA)

> Does this refer to another bug? The RLE bug is the only one I saw.

Nope, I just applied your patch and tried several element sizes. Up to 70.000 everything was fine, after that I got a SEGV. No optimisation, no RLE at all.

@swift-ci
Copy link
Collaborator Author

Comment by Andreas Wendleder (JIRA)

> The problem isn't the inlined instructions, it's the instructions after the call in the same block. In general, they all need to be moved to a new block which means adjusting all their parent pointers. Those moved instructions may themselves be calls that need to be inlined, so we end up moving each original instruction number-of-calls times. Inlining in reverse means we never need to revisit an instruction that was already moved.

Reversing is fine in my case. I still think allocating a new basic block and copying all instructions there and inlining all functions there might simplify the overall logic. But that's just a thought. I am a) not a compiler engineer and b) not too familiar with the Swift code base, so YMMV. 🙂

@atrick
Copy link
Member

atrick commented Nov 17, 2018

graydon (JIRA User) I'd like to add a scale test for the mandatory inlining of a giant array initialization. Quadratic behavior arises when the inliner splits basic blocks. Splitting a block visits all subsequent instructions and adjusts their parent block pointer. See
`llvm::ilist_traits<SILInstruction>::transferNodesFromList` SILInstruction.cpp.
I could add a stat inside the loop:

  // Update the parent fields in the instructions.
  for (; first != last; ++first)
    first->ParentBB = ThisParent;

That seems fairly intrusive.

I could add a separate loop in the inliner to count the number of instructions after each split point. That seems wasteful. Can you think of a better way to verify that mandatory inlining isn't quadratic?

@swift-ci
Copy link
Collaborator Author

swift-ci commented Nov 17, 2018

Comment by Andreas Wendleder (JIRA)

That is the stack at the SEGV:

(lldb) bt

-   thread #1, name = 'swift', stop reason = signal SIGSEGV: invalid address (fault address: 0xffffffffffffff22)

-   frame #0: 0x0000000000d0c321 swift`(anonymous namespace)::ArgEmitter::emit(swift::Lowering::ArgumentSource&&, swift::Lowering::AbstractionPattern) + 2017
    frame #1: 0x0000000000d0d1b0 swift`(anonymous namespace)::ArgEmitter::emitExpanded(swift::Lowering::ArgumentSource&&, swift::Lowering::AbstractionPattern) + 816
    frame #2: 0x0000000000d0bce0 swift`(anonymous namespace)::ArgEmitter::emit(swift::Lowering::ArgumentSource&&, swift::Lowering::AbstractionPattern) + 416
    frame #3: 0x0000000000d075a0 swift`(anonymous namespace)::ArgEmitter::emitTopLevel(swift::Lowering::ArgumentSource&&, swift::Lowering::AbstractionPattern) + 4656
    frame #4: 0x0000000000d156c6 swift`(anonymous namespace)::CallSite::emit(swift::Lowering::SILGenFunction&, swift::Lowering::AbstractionPattern, swift::CanTypeWrapper<swift::SILFunctionType>, (anonymous namespace)::ParamLowering&, llvm::SmallVectorImpl<swift::Lowering::ManagedValue>&, llvm::SmallVectorImpl<(anonymous namespace)::DelayedArgument>&, llvm::Optional<swift::ForeignErrorConvention> const&, swift::ImportAsMemberStatus) && + 710
    frame #5: 0x0000000000d15130 swift`(anonymous namespace)::CallEmission::emitArgumentsForNormalApply(swift::CanTypeWrapper<swift::FunctionType>&, swift::Lowering::AbstractionPattern&, swift::CanTypeWrapper<swift::SILFunctionType>, llvm::Optional<swift::ForeignErrorConvention> const&, swift::ImportAsMemberStatus, llvm::SmallVectorImpl<swift::Lowering::ManagedValue>&, llvm::Optional<swift::SILLocation>&, swift::CanTypeWrapper<swift::FunctionType>&) + 1200
    frame #6: 0x0000000000d00a42 swift`(anonymous namespace)::CallEmission::apply(swift::Lowering::SGFContext) + 3202
    frame #7: 0x0000000000cffd1a swift`swift::Lowering::SILGenFunction::emitApplyExpr(swift::Expr*, swift::Lowering::SGFContext) + 2218
    frame #8: 0x0000000000ca8a83 swift`swift::ASTVisitor<(anonymous namespace)::RValueEmitter, swift::Lowering::RValue, void, void, void, void, void, swift::Lowering::SGFContext>::visit(swift::Expr*, swift::Lowering::SGFContext) + 83
    frame #9: 0x0000000000caae08 swift`swift::ASTVisitor<(anonymous namespace)::RValueEmitter, swift::Lowering::RValue, void, void, void, void, void, swift::Lowering::SGFContext>::visit(swift::Expr*, swift::Lowering::SGFContext) + 9176
    frame #10: 0x0000000000c9eba1 swift`swift::Lowering::SILGenFunction::emitExprInto(swift::Expr*, swift::Lowering::Initialization*, llvm::Optional<swift::SILLocation>) + 289
    frame #11: 0x0000000000c93288 swift`swift::Lowering::SILGenFunction::emitPatternBinding(swift::PatternBindingDecl*, unsigned int) + 280
    frame #12: 0x0000000000c9333d swift`swift::Lowering::SILGenFunction::visitPatternBindingDecl(swift::PatternBindingDecl*) + 45
    frame #13: 0x0000000000c724af swift`swift::Lowering::SILGenModule::visitTopLevelCodeDecl(swift::TopLevelCodeDecl*) + 255
    frame #14: 0x0000000000c72b3b swift`swift::Lowering::SILGenModule::emitSourceFile(swift::SourceFile*) + 811
    frame #15: 0x0000000000c736d1 swift`swift::SILModule::constructSIL(swift::ModuleDecl*, swift::SILOptions&, swift::FileUnit*) + 273
    frame #16: 0x0000000000c73b77 swift`swift::performSILGeneration(swift::FileUnit&, swift::SILOptions&) + 23
    frame #17: 0x00000000004ca182 swift`performCompile(swift::CompilerInstance&, swift::CompilerInvocation&, llvm::ArrayRef<char const*>, int&, swift::FrontendObserver*, swift::UnifiedStatsReporter*) + 7410
    frame #18: 0x00000000004c77be swift`swift::performFrontend(llvm::ArrayRef<char const*>, char const*, void*, swift::FrontendObserver*) + 3454
    frame #19: 0x000000000047f01e swift`main + 670
    frame #20: 0x00007ffff73f009b libc.so.6`__libc_start_main(main=(swift`main), argc=13, argv=0x00007fffffffdd88, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffdd78) at libc-start.c:308:16
    frame #21: 0x000000000047d5aa swift`_start + 42

@atrick
Copy link
Member

atrick commented Nov 17, 2018

This is the SILGen assert for the initialization of 70k array elements:

Assertion failed: (params.size() == labels.size()), function relabelParams, file /s/sown/swift/lib/AST/ASTContext.cpp, line 3784.

This should be filed as a separate bug against SILGen. Someone may have thought it was ok to use 16 bits for a param index.

@swift-ci
Copy link
Collaborator Author

swift-ci commented Nov 17, 2018

Comment by Andreas Wendleder (JIRA)

I created #51762 for this.

@swift-ci
Copy link
Collaborator Author

Comment by Graydon Hoare (JIRA)

@atrick I'm not sure what you mean by intrusive; are you concerned with the overhead of bumping a counter during that loop?

@atrick
Copy link
Member

atrick commented Nov 19, 2018

graydon (JIRA User) I'm not too concerned about adding a pointer increment if that's really the best way to do it. By intrusive I mean:

  1. If SIL has any performance-critical loops, this is a candidate (after walking operands).

  2. It seems odd to add a stat in such a low-level SIL operation.

Still, I think it would be worth it to have a test case. I just want to make sure it's the best way to do it.

@swift-ci
Copy link
Collaborator Author

swift-ci commented Apr 4, 2019

Comment by Andreas Wendleder (JIRA)

It's getting worse: Timings with Swift 5:

5.000 elements: 7.5 seconds

10.000 elements: 34 seconds

15.000 elements: 81 seconds

@swift-ci
Copy link
Collaborator Author

Comment by Andreas Wendleder (JIRA)

Performance numbers with latest snapshot from 2019-04-19 with fixes from @eeckstein:

5.000 elements: 1.4 seconds

10.000 elements: 5.5 seconds

15.000 elements: 12.2 seconds

20.000 elements: 18.9 seconds

Better but still quadratic.

@swift-ci
Copy link
Collaborator Author

Comment by Andreas Wendleder (JIRA)

Swift 5.0.1 on Linux: 10.000 elements: 16 seconds.

@swift-ci
Copy link
Collaborator Author

Comment by Andreas Wendleder (JIRA)

Swift 5.1 macOS: 10.000 elements: 11 seconds.

@swift-ci
Copy link
Collaborator Author

swift-ci commented Oct 31, 2019

Comment by Andreas Wendleder (JIRA)

Swift 5.1.1 on Linux: 10.000 elements: 3.6s

Perf trace:

+ 85,74% 0,32% swift swift [.] swift::SILFunction::verify
+ 84,71% 0,35% swift swift [.] (anonymous namespace)::SILVerifier::visitSILBasicBlock
+ 84,30% 0,33% swift swift [.] swift::SILInstructionVisitor<(anonymous namespace)::SILVerifier, void>::visit
+ 82,49% 79,61% swift swift [.] (anonymous namespace)::SILVerifier::visitSILInstruction
+ 71,79% 0,00% swift swift [.] swift::SILModule::verify
+ 13,81% 0,00% swift swift [.] swift::Lowering::SILGenModule::emitSourceFile

@swift-ci
Copy link
Collaborator Author

swift-ci commented Jun 8, 2020

Comment by Andreas Wendleder (JIRA)

Swift 5.2.4 Linux:

10.000 elements: 4.4s

15.000 elements: 10.1s

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
@gonsolo
Copy link
Contributor

gonsolo commented Oct 4, 2022

Swift 5.7 Linux:

15.000 elements: 0.8s
15.000 elements with "-O": 10.7s

cc @eeckstein @atrick @gottesmm

Updated test case:

SIZE=15000
SHELL:=/bin/bash

#CONFIGURATION=-O

SWIFTC=swiftc

time: bug.swift
	@echo "Size: $(SIZE)"
	time $(SWIFTC) 	$(CONFIGURATION) bug.swift

bug.swift: Makefile

	@echo "let x: [UInt] = [" > $@
	@LANG=C printf '0x%x, ' $$(seq $(SIZE)) >> $@
	@echo "]" >> $@

.PHONY: c clean e edit
c: clean
clean:
	rm -f bug bug.swift
e: edit
edit:
	vi bug.swift
m:
	vi Makefile
p: perf
perf: bug.swift
	time perf record -e cycles -c 10000 --call-graph=lbr $(SWIFTC) $(CONFIGURATION) bug.swift
	perf report

@gonsolo
Copy link
Contributor

gonsolo commented Apr 26, 2023

Swift 5.8 Linux:

15.000 elements: 0.9s
15.000 elements with "-O": 11.3s

@AnthonyLatsis AnthonyLatsis added compiler The Swift compiler in itself SILOptimizer Area → compiler: SIL optimization passes literals Feature → expressions: Literals such as an integer or string literal expressions Feature: expressions optimized only Flag: An issue whose reproduction requires optimized compilation swift 5.8 labels Apr 28, 2023
@AnthonyLatsis
Copy link
Collaborator

Is this still an issue with the optimization passes?

@gonsolo
Copy link
Contributor

gonsolo commented Apr 29, 2023

Depends on whether you find 11.3 seconds acceptable. C++ (as every other language I tried) is <0.5 seconds.
I haven't run perf on Swift 5.8 so I don't know where the time is spent, but the Makefile above should make it easy to spot the culprit.

@AnthonyLatsis
Copy link
Collaborator

No, definitely not acceptable. Was just wondering if you verified if/how the distribution had changed.

@gonsolo
Copy link
Contributor

gonsolo commented May 1, 2023

I ran perf again:

21.11%  swift-frontend   swift-frontend                [.] llvm::DenseMapBase<llvm::DenseMap<swift::AliasAnalysis::AliasCacheKey, swift::AliasAnalysis::AliasResult, llvm::DenseMapInfo<swift::AliasAnalysis::AliasCacheKey, void>, llvm::detail::DenseMapPair<swift::AliasAnalysis::AliasCacheKey, swift::AliasAnalysis::AliasResult> >, swift::AliasAnalysis::AliasCacheKey, swift::AliasAnalysis::AliasResult, llvm::DenseMapInfo<swift::AliasAnalysis::AliasCacheKey, void>, llvm::detail::DenseMapPair<swift::AliasAnalysis::AliasCacheKey, swift::AliasAnalysis::AliasResult> >::LookupBucketFor<swift::AliasAnalysis::AliasCacheKey>
12.60%  swift-frontend   swift-frontend                [.] swift::ProjectionPath::hasNonEmptySymmetricDifference
10.72%  swift-frontend   swift-frontend                [.] swift::AliasAnalysis::alias
 8.17%  swift-frontend   swift-frontend                [.] (anonymous namespace)::DSEContext::invalidateBase
 5.84%  swift-frontend   swift-frontend                [.] (anonymous namespace)::BlockState::processStoreInst

Looks like a hash collision, maybe a better AliasCacheKey? @AnthonyLatsis @eeckstein @atrick

@AnthonyLatsis AnthonyLatsis added the duplicate Resolution: Duplicates another issue label May 3, 2023
@AnthonyLatsis
Copy link
Collaborator

Current stats make this a duplicate of #50243. Closing in favor of the older issue.

@AnthonyLatsis AnthonyLatsis closed this as not planned Won't fix, can't repro, duplicate, stale May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. compiler The Swift compiler in itself duplicate Resolution: Duplicates another issue expressions Feature: expressions literals Feature → expressions: Literals such as an integer or string literal optimized only Flag: An issue whose reproduction requires optimized compilation performance SILOptimizer Area → compiler: SIL optimization passes swift 5.8
Projects
None yet
Development

No branches or pull requests

5 participants
@gonsolo @atrick @swift-ci @AnthonyLatsis and others