-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SR-4669] Add a Benchmark_Driver --rerun N option #47246
Comments
Andrew, I think I could work on that, but I'm still a bit fuzzy on motivation here. It seems like a workaround for the fact that currently reported improvements and regressions are unstable, because they are not computed from Mean and don't use standard deviation to determine if the change is truly significant. See SR-4597, I'm currently working on that. |
I don't recommend anyone work on a feature unless they see the value in it for your own work. Here's my motivation for the feature. Results will be unstable or misleading whenever you have an insufficient timing interval or number of data points, regardless of how you analyze the data. I'm not interested in either the mean or standard deviation because I find microbenchmark kernels to be modal depending on the system state. I also find that rerunning the benchmark at a different time, after running a number of other workloads is more effective at weeding out noise than continuing to run the same loop for a couple more seconds. For microbenchmarks that are not intended to measure cache effects or system I/O, I'm most interested in the minimum time, which is almost always a hard limit once you factor out noise. Remember that microbenchmarks are not user workloads to begin with, so measuring their incidental affects on system behavior isn't particularly interesting. I only have time to wait for about 3 full runs of the benchmark driver. That's 3 separate runs, not 3 iterations per benchmark. I only have time to investigate large (> 5%) regressions. Just rerunning the small number of tests that may have been affected by system noise is very effective. I do this manually now every time I benchmark. Benchmarking should always be on a quiet machine if you value your sanity. If you're using multi-user servers, you need to ability to hard-partition it's cores, memory, and file system. In the past, people have tried to factor out noise in Swift's benchmark suite by increasing the number of iterations in a single invocation of the benchmark driver. I think that's a terrible idea beyond a few iterations. `swift-ci benchmark` currently runs 20 iterations, taking all day to run and making it essentially unusable. That's why I still manually benchmark during development. Running for 20 iterations does accidentally cut down on noise simply by running each benchmark for a ridiculously long time. You could get a far superior result by running for 3 iterations and rerunning only the noisy benchmarks for 10 more iterations as a separate driver invocation. Also, you would have results in less than 20 minutes! Now, if you care about 1-2% performance changes on microbenchmarks, then you better make sure all your path names and environment variables are exactly the same length, pin the process to a core, then spend days gathering hundreds of data points. Once you've done that, then I applaud your statistical rigor. Unfortunately, I don't think our current CI setup is good for that. I also don't think our microbenchmarks are rigorously designed to factor out startup time. One more thing to keep in mind if you want to chase down every small performance change. Even reproducible changes in benchmark performance are usually "noise" in the sense that they are incidental side-effects of imperfect compiler heuristics and unmodeled system behavior. An irrefutably good compiler optimization usually regresses some benchmarks as a result of bad luck in some downstream compiler pass. Getting to the bottom of these regressions is always a time consuming process. |
That makes sense. I think I have arrived at similar heuristic, but I think Mean and SD can help us here. Have a look at the attached Numbers file I used to analyze benchmark runs while making my case against averaging MAX_RSS in #8793. The Release-3s tab contains run with 3 samples, Release-20s with 20 samples. I'm working on Late 2008 MBP, so I feel your pain about the length of benchmark runs. I also usually run with 3 samples default, that I cribbed from After fixing If I understand you correctly, you suggest to re-run Or you want to rerun only changes that are over 5%? You seem to be a fan of long-form writing, could you read through and chime in on what I argued in #8793? You mentioned above:
My understanding is this:
BTW, when working on particular subset of code, for example benchmarking PrefixSequence, I felt the need to run a family of test, without having to specify them all one-by-one, so I'm adding support for |
My motivation for this --rerun option relates to measuring CPU time of In either case, MEAN is not very interesting until you have a very I do think it would be potentially useful to rerun certain benchmarks Instead I'm proposing that the driver rerun some benchmarks only |
@atrick I was thinking about how you (mostly don't) use the Benchmark_Driver. I'll definitely work on this |
I actually have a "repeat" script. I give it the driver command line and tell it how many times to run the command. It makes perfect sense to have an option to do that. Ideally I would like to see two independent options: --num-samples N (--iterations is a misnomer. I'm not sure how we ended up with that name. Right @gottesmm?) However, I still want to be able to manually rerun benchmarks and concatenate the output. Sometimes you decide to gather more data later without knowing ahead of time how many samples/repeats you need. I don't want to throw away the old data though. |
@atrick You are correct. |
Regarding the ability to handle multiple entries in the output per key... |
Closing, as this was superseded by @eeckstein's work on |
Attachment: Download
Additional Detail from JIRA
md5: 904f74a88b2c2fd58519e6a6f68673ba
is duplicated by:
Issue Description:
This feature would work as follows;
1. Run all the benchmarks as usual, according to all the other options just like 'Benchmark_Driver run'
2. Run the compare script just like Benchmark_Driver compare.
3. From the output of the comparison scrape a list of tests from the significant regressions and improvements.
4. Rerun just that subset for N iterations. The user would normally want N to be much higher than the initial iterations. Say 20 vs. 3. The output of each rerun should simply be appended to the previous output. That's how the compare_script was originally designed to work.
The driver has almost all of the functionality to do this already. The only thing missing is parsing the compare_script's output.
An alternative would be to make the compare_script a python module that the driver can import.
The text was updated successfully, but these errors were encountered: