Skip to content

Single file takes 30 minutes to compile at opt-level=3, regression from <1 sec #66617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dabek opened this issue Nov 22, 2019 · 16 comments
Closed
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-compiletime Issue: Problems and improvements with respect to compile times. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance

Comments

@dabek
Copy link

dabek commented Nov 22, 2019

Put this in your main.rs file: https://gist.github.com/dabek/3474504ab3dae4d264b78504a4d2ee21
Run:

cargo run --release

See it take half an hour (at least it did for me the only time I had the patience to wait for it).

@dtolnay dtolnay added C-bug Category: This is a bug. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Nov 22, 2019
@dtolnay
Copy link
Member

dtolnay commented Nov 22, 2019

Labeling as a regression because 🤣 something is going on:

rustc time cargo build --release
1.0 - 1.18 <1 sec
1.19 7.5 sec
1.20 30 sec
1.21 - 1.23 4 sec
1.24 2 sec
1.25 - 1.41-nightly 30 minutes

@alex
Copy link
Member

alex commented Nov 22, 2019

I got bored after 40 seconds (I am not a patient man) -- this reproduces at opt-level=3, but not opt-level=2. Hopefully that helps narrow it down.

@dtolnay dtolnay changed the title Single 1600 line file with generics taking 30 minutes to compile in release mode Single file takes 30 minutes to compile at opt-level=3, regression from <1 sec Nov 22, 2019
@csmoe csmoe added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Nov 22, 2019
@vivekvpandya
Copy link

On rustc 1.39.0 I see it hangs when running following command :
rustc -C opt-level=3 --emit link main.rs

however
rustc -C opt-level=3 --emit obj main.rs generates .a file quickly , so it seems that linker is not happy with something generate by optimization.

@jsgf
Copy link
Contributor

jsgf commented Nov 22, 2019

-Ztime-passes is showing:

  time: 0.226; rss: 134MB       codegen
    time: 0.002; rss: 134MB     llvm function passes [slow.7rcbfp3g-cgu.4]
    time: 0.001; rss: 134MB     llvm function passes [slow.7rcbfp3g-cgu.2]
    time: 0.004; rss: 134MB     llvm module passes [slow.7rcbfp3g-cgu.2]
    time: 0.014; rss: 136MB     llvm module passes [slow.7rcbfp3g-cgu.4]
    time: 0.039; rss: 136MB     llvm module passes [slow.7rcbfp3g-cgu.3]
    time: 0.050; rss: 136MB     llvm function passes [slow.7rcbfp3g-cgu.0]
    time: 24.948; rss: 267MB    llvm module passes [slow.7rcbfp3g-cgu.0]
>>  time: 4923.626; rss: 297MB  llvm module passes [slow.7rcbfp3g-cgu.1]
    time: 0.004; rss: 267MB     LTO passes
    time: 0.004; rss: 267MB     LTO passes
    time: 0.010; rss: 268MB     codegen passes [slow.7rcbfp3g-cgu.2]
    time: 0.010; rss: 268MB     codegen passes [slow.7rcbfp3g-cgu.4]
    time: 0.056; rss: 269MB     LTO passes
    time: 0.082; rss: 270MB     codegen passes [slow.7rcbfp3g-cgu.3]
    time: 25.502; rss: 270MB    LTO passes
    time: 3.664; rss: 279MB     codegen passes [slow.7rcbfp3g-cgu.0]

It hasn't completed yet. I'm also running it under perf. So far the top hot functions seem to be llvm::isSafeToSpeculativelyExecute and llvm::SmallPtrSetImplBase::FindBucketFor.

facebook-github-bot pushed a commit to facebook/hhvm that referenced this issue Jan 16, 2020
Summary:
Today in "sacrificing readability / safety due to Rust inefficiencies":

I believe some of our compile times are due to (less severe than in example there) version of this bug: rust-lang/rust#66617

Adding box in a key place happens to circumvent it.

Reviewed By: shiqicao

Differential Revision: D18772104

fbshipit-source-id: a94271b208894aecd9bb8ee1f35110b3c0b09961
@andjo403
Copy link
Contributor

andjo403 commented Feb 22, 2020

at nightly (rustc 1.43.0-nightly (8aa9d20 2020-02-21)) around 2sec
the big regression was due to before one huge function was generated for core::ptr::drop_in_place but now multiple smaller functions is generated and that makes llvm mush happier :)
I do not know when this was changed only noticed it now.

@jonas-schievink jonas-schievink added the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label Feb 22, 2020
@jonas-schievink
Copy link
Contributor

Marking needstest to add a test to https://github.com/rust-lang-nursery/rustc-perf/

@xttjsn
Copy link

xttjsn commented Mar 13, 2020

Failed to reproduce the fix at 8aa9d20 2020-02-21. Some attempts that I've tried:

~/src/issue66617 (master*) [11:26:27] 0
xtt$ rustc +8aa9d2014f4e5258f83b907e8431c59a33acdae7 -vV
rustc 1.43.0-nightly (8aa9d2014 2020-02-21)
binary: rustc
commit-hash: 8aa9d2014f4e5258f83b907e8431c59a33acdae7
commit-date: 2020-02-21
host: x86_64-apple-darwin
release: 1.43.0-nightly
LLVM version: 9.0
~/src/issue66617 (master*) [11:30:07] 1
xtt$ rustc +8aa9d2014f4e5258f83b907e8431c59a33acdae7 -C opt-level=3 -Ztime-passes ./src/main.rs
time: 0.006     parse_crate
time: 0.000     attributes_injection
time: 0.000     recursion_limit
time: 0.000     plugin_loading
time: 0.000     plugin_registration
time: 0.000     pre_AST_expansion_lint_checks
time: 0.000     crate_injection
time: 0.014     expand_crate
time: 0.000     check_unused_macros
time: 0.014     macro_expand_crate
time: 0.000     maybe_building_test_harness
time: 0.001     AST_validation
time: 0.000     maybe_create_a_macro_crate
time: 0.000     complete_gated_feature_checking
time: 0.017     configure_and_expand
time: 0.000     prepare_outputs
time: 0.002     hir_lowering
time: 0.000     early_lint_checks
time: 0.001     validate_HIR_map
time: 0.000     setup_global_ctxt
time: 0.000     dep_graph_tcx_init
time: 0.003     create_global_ctxt
time: 0.000     looking_for_entry_point
time: 0.000     looking_for_plugin_registrar
time: 0.000     looking_for_derive_registrar
time: 0.001     misc_checking_1
time: 0.006     type_collecting
time: 0.000     impl_wf_inference
time: 0.000     unsafety_checking
time: 0.000     orphan_checking
time: 0.002     coherence_checking
time: 0.007     wf_checking
time: 0.021     item_types_checking
time: 0.012     item_bodies_checking
time: 0.047     type_check_crate
time: 0.000     match_checking
time: 0.001     liveness_and_intrinsic_checking
time: 0.001     misc_checking_2
time: 0.005     MIR_borrow_checking
time: 0.000     dumping_chalk_like_clauses
time: 0.000     MIR_effect_checking
time: 0.000     layout_testing
time: 0.001     death_checking
time: 0.000     unused_lib_feature_checking
time: 0.005     crate_lints
time: 0.002     module_lints
time: 0.007     lint_checking
time: 0.002     privacy_checking_modules
time: 0.011     misc_checking_3
time: 0.000     monomorphization_collector_root_collections
time: 0.031     monomorphization_collector_graph_walk
time: 0.005     partition_and_assert_distinct_symbols
time: 0.000     write_allocator_module
time: 0.000     find_cgu_reuse
time: 0.031     LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.4)
time: 0.112     codegen_to_LLVM_IR
time: 0.000     assert_dep_graph
time: 0.000     serialize_dep_graph
time: 0.150     codegen_crate
time: 0.002     LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.1)
time: 0.001     LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.0)
time: 0.001     LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.3)
time: 0.001     LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.3)
time: 0.006     free_global_ctxt
time: 0.004     LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.0)
time: 0.011     LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.1)
time: 0.035     LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.2)
time: 10.943    LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.2)

It hasn't completed but I have waited for about 2 mins.

Here's a part of a perf report (I re-ran it on a linux machine):

Samples: 11K of event 'cpu-clock:pppH', Event count (approx.): 2875250000
Overhead  Command  Shared Object                     Symbol
  32.88%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] llvm::isSafeToSpeculativelyExecute
  27.12%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] llvm::SmallPtrSetImplBase::insert_imp_big
  16.16%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] llvm::SmallPtrSetImplBase::Grow
  10.98%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] appendSpeculatableOperands
   4.83%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] completeEphemeralValues
   1.43%  rustc    libc-2.27.so                      [.] __memset_avx2_erms
   1.31%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] canTrapImpl
   1.18%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] llvm::CodeMetrics::collectEphemeralValues
   0.57%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] (anonymous namespace)::CallAnalyzer::analyzeCall
...

The newest master seems to have the same issue. I doubt it was fixed.

@spastorino spastorino added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label May 21, 2020
@JohnTitor JohnTitor removed the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label May 27, 2020
@spastorino
Copy link
Member

Assigning P-medium as discussed as part of the Prioritization Working Group process and removing I-prioritize.

@spastorino spastorino added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels May 27, 2020
@JohnTitor JohnTitor added the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label May 27, 2020
@jonas-schievink jonas-schievink removed the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label May 27, 2020
@spastorino
Copy link
Member

spastorino commented May 27, 2020

Just in case, wanted to point out that this is still an issue using rustc 1.45.0-nightly (a74d1862d 2020-05-14).

@dtolnay
Copy link
Member

dtolnay commented Jun 24, 2020

I got some time to make progress on this today. Here's what's up:

Some observations:

  • Does not reproduce with -Z new-llvm-pass-manager. This doesn't mean just wait for the new pass manager to become default and then close this issue, since it could be a problem that gets reintroduced as more stuff is ported to new pass manager. We wouldn't want to revert the workarounds in HHVM only for it to disruptively regress in another LLVM upgrade.
  • Does not reproduce with -C codegen-units=1. Presumably the function experiencing the catastrophic performance is erased by DCE or similar before the problematic pass.

Progress:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 99.8300 seconds (99.9532 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  98.6132 ( 98.9%)   0.0245 ( 34.9%)  98.6378 ( 98.8%)  98.7605 ( 98.8%)  Function Integration/Inlining
   0.1416 (  0.1%)   0.0046 (  6.5%)   0.1462 (  0.1%)   0.1463 (  0.1%)  Combine redundant instructions #2
   0.0660 (  0.1%)   0.0021 (  3.0%)   0.0681 (  0.1%)   0.0682 (  0.1%)  Global Value Numbering
   0.0576 (  0.1%)   0.0020 (  2.9%)   0.0596 (  0.1%)   0.0596 (  0.1%)  Combine redundant instructions #5
   0.0580 (  0.1%)   0.0016 (  2.3%)   0.0596 (  0.1%)   0.0596 (  0.1%)  Combine redundant instructions #3
   0.0577 (  0.1%)   0.0013 (  1.8%)   0.0590 (  0.1%)   0.0589 (  0.1%)  Combine redundant instructions #4
   0.0483 (  0.0%)   0.0011 (  1.6%)   0.0494 (  0.0%)   0.0494 (  0.0%)  Jump Threading
   0.0428 (  0.0%)   0.0013 (  1.8%)   0.0441 (  0.0%)   0.0441 (  0.0%)  Early CSE w/ MemorySSA
   0.0398 (  0.0%)   0.0010 (  1.4%)   0.0408 (  0.0%)   0.0408 (  0.0%)  Memory SSA
...

I will take a basic look in the function integration/inlining pass later this week with profiling tools and then hand over to my company's LLVM team if needed.

@nikic
Copy link
Contributor

nikic commented Jun 24, 2020

Looks like all the time is spent inside CodeMetrics::collectEphemeralValues(). There's also this helpful comment: https://github.com/llvm/llvm-project/blob/01ddb2a7b044f697a15043e47acdb93e2825809a/llvm/lib/Analysis/InlineCost.cpp#L2076-L2080

@dtolnay
Copy link
Member

dtolnay commented Jun 25, 2020

The behavior looks exactly cubic in the number of enum variants as measured by time rustc main.rs -C opt-level=3 -C codegen-units=2 on the reproducer in #66617 (comment). Compile time for x variants is accurately predicted by some constant times x^3.


slow

@dtolnay
Copy link
Member

dtolnay commented Jul 15, 2020

Update:

The cubic behavior is a legacy pass manager limitation which is a consequence of not having call site's BFI (block frequency) in the legacy inliner. This is a known advantage for inlining with the new pass manager (#74705) vs legacyPM.

Here is the place legacy inliner is lacking BFI under legacyPM (note nullptr for GetBFI):
https://github.com/llvm/llvm-project/blob/3c2299612945caf75d5c3678ced0693ebd291819/llvm/lib/Transforms/IPO/InlineSimple.cpp#L71-L73

At this point we are going to investigate transitioning our affected codebase to newPM.

For anyone needing to work around this on legacyPM, one effective legacyPM workaround we found is running opt with -inline-last-call-to-static-bonus=0 (would need to look up what the LLVM C API equivalent of this is), but it comes at the cost of possible code size regressions.

@andjo403
Copy link
Contributor

for this specific case where a lot of vec's is used in the enums I was able to get the times down by adding inline to RawVec::ptr this was one of the two functions that was defined in cgu0 but used in cgu1 and when inline was added it is defined in cgu1 also.

I do not understand why the llvm inline pass time is affected by if the function is defined or declared in one cgu but will try to look in to it some more tomorrow.

Is a PR with with the inline hint something to do? even as it do not solve the complete problem a lot of functions is marked with inline in RawVec but not this one liner function.

bors added a commit to rust-lang-ci/rust that referenced this issue Jan 26, 2021
mark raw_vec::ptr with inline

when a lot of vectors is used in a enum as in the example in rust-lang#66617 if this function is not inlined and multiple cgus is used this results in huge compile times. with this fix the compile time is 6s from minutes for the example in rust-lang#66617. I did not have the patience to wait for it to compile for more then 3 min.
@ljedrz
Copy link
Contributor

ljedrz commented Jul 6, 2022

Can't reproduce in current stable (1.62.0).

@workingjubilee
Copy link
Member

Thank you, @ljedrz and everyone else who has gone to the effort of reproducing or investigating this bug. Considering the current status and dtolnay tracing it directly to the pass manager,it looks like we can close this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-compiletime Issue: Problems and improvements with respect to compile times. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance
Projects
None yet
Development

No branches or pull requests