Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-4669] Add a Benchmark_Driver --rerun N option #47246

Closed
atrick opened this issue Apr 22, 2017 · 9 comments
Closed

[SR-4669] Add a Benchmark_Driver --rerun N option #47246

atrick opened this issue Apr 22, 2017 · 9 comments
Assignees
Labels
benchmarks feature A feature request or implementation good first issue Good for newcomers

Comments

@atrick
Copy link
Member

atrick commented Apr 22, 2017

Previous ID SR-4669
Radar rdar://problem/32029925
Original Reporter @atrick
Type New Feature
Status Closed
Resolution Done

Attachment: Download

Additional Detail from JIRA
Votes 1
Component/s Source Tooling
Labels New Feature, StarterBug
Assignee @palimondo
Priority Medium

md5: 904f74a88b2c2fd58519e6a6f68673ba

is duplicated by:

  • SR-4814 Smoke benchmark optimization

Issue Description:

This feature would work as follows;

1. Run all the benchmarks as usual, according to all the other options just like 'Benchmark_Driver run'

2. Run the compare script just like Benchmark_Driver compare.

3. From the output of the comparison scrape a list of tests from the significant regressions and improvements.

4. Rerun just that subset for N iterations. The user would normally want N to be much higher than the initial iterations. Say 20 vs. 3. The output of each rerun should simply be appended to the previous output. That's how the compare_script was originally designed to work.

The driver has almost all of the functionality to do this already. The only thing missing is parsing the compare_script's output.

An alternative would be to make the compare_script a python module that the driver can import.

@palimondo
Copy link
Mannequin

palimondo mannequin commented Apr 22, 2017

Andrew, I think I could work on that, but I'm still a bit fuzzy on motivation here. It seems like a workaround for the fact that currently reported improvements and regressions are unstable, because they are not computed from Mean and don't use standard deviation to determine if the change is truly significant. See SR-4597, I'm currently working on that.

@atrick
Copy link
Member Author

atrick commented Apr 22, 2017

I don't recommend anyone work on a feature unless they see the value in it for your own work. Here's my motivation for the feature.

Results will be unstable or misleading whenever you have an insufficient timing interval or number of data points, regardless of how you analyze the data. I'm not interested in either the mean or standard deviation because I find microbenchmark kernels to be modal depending on the system state. I also find that rerunning the benchmark at a different time, after running a number of other workloads is more effective at weeding out noise than continuing to run the same loop for a couple more seconds. For microbenchmarks that are not intended to measure cache effects or system I/O, I'm most interested in the minimum time, which is almost always a hard limit once you factor out noise. Remember that microbenchmarks are not user workloads to begin with, so measuring their incidental affects on system behavior isn't particularly interesting.

I only have time to wait for about 3 full runs of the benchmark driver. That's 3 separate runs, not 3 iterations per benchmark. I only have time to investigate large (> 5%) regressions. Just rerunning the small number of tests that may have been affected by system noise is very effective. I do this manually now every time I benchmark. Benchmarking should always be on a quiet machine if you value your sanity. If you're using multi-user servers, you need to ability to hard-partition it's cores, memory, and file system.

In the past, people have tried to factor out noise in Swift's benchmark suite by increasing the number of iterations in a single invocation of the benchmark driver. I think that's a terrible idea beyond a few iterations. `swift-ci benchmark` currently runs 20 iterations, taking all day to run and making it essentially unusable. That's why I still manually benchmark during development. Running for 20 iterations does accidentally cut down on noise simply by running each benchmark for a ridiculously long time. You could get a far superior result by running for 3 iterations and rerunning only the noisy benchmarks for 10 more iterations as a separate driver invocation. Also, you would have results in less than 20 minutes!

Now, if you care about 1-2% performance changes on microbenchmarks, then you better make sure all your path names and environment variables are exactly the same length, pin the process to a core, then spend days gathering hundreds of data points. Once you've done that, then I applaud your statistical rigor. Unfortunately, I don't think our current CI setup is good for that. I also don't think our microbenchmarks are rigorously designed to factor out startup time.

One more thing to keep in mind if you want to chase down every small performance change. Even reproducible changes in benchmark performance are usually "noise" in the sense that they are incidental side-effects of imperfect compiler heuristics and unmodeled system behavior. An irrefutably good compiler optimization usually regresses some benchmarks as a result of bad luck in some downstream compiler pass. Getting to the bottom of these regressions is always a time consuming process.

@palimondo
Copy link
Mannequin

palimondo mannequin commented Apr 22, 2017

That makes sense. I think I have arrived at similar heuristic, but I think Mean and SD can help us here. Have a look at the attached Numbers file I used to analyze benchmark runs while making my case against averaging MAX_RSS in #8793. The Release-3s tab contains run with 3 samples, Release-20s with 20 samples. I'm working on Late 2008 MBP, so I feel your pain about the length of benchmark runs. I also usually run with 3 samples default, that I cribbed from build-script -B.

After fixing Benchamark_Driver to correctly report MEAN and SD from Benchmark_O I used it to compute coefficient of variation (CV) defined as the ratio of the standard deviation to the mean.This would be another useful heuristic to get a subset of tests to
re-run with higher number of samples. As you can see, on the tab for run with 20 samples, this tends to lower the CV. But this is could also be a side effect of me not doing anything on my machine while that painfully long slog ran overnight...

If I understand you correctly, you suggest to re-run Benchmark_Driver after running comparison, with the rests that have changed – regressed or improved – with higher number of samples, specified in argument --rerun N. This gives you another independent sample of them, so if the first batch was reported due to some background process running at that time, this has higher chance to get an unspoiled sample.

Or you want to rerun only changes that are over 5%?

You seem to be a fan of long-form writing, could you read through and chime in on what I argued in #8793? You mentioned above:

I'm most interested in the minimum time, which is almost always a hard limit once you factor out noise.

My understanding is this:

When measuring runtime performance on multitasking system, we have to take into account the system’s weather: other processes competing for the CPU. We are taking multiple samples and trying to statistically approximate a “true” runtime. Instead of measuring how would our benchmark perform if it was the only thing running — this is never the case — we settle for a consistently reproducible result. We should be discarding extreme values and work with the mean.

BTW, when working on particular subset of code, for example benchmarking PrefixSequence, I felt the need to run a family of test, without having to specify them all one-by-one, so I'm adding support for --filter PREFIX, see SR-4598. You might find that useful, too.

@atrick
Copy link
Member Author

atrick commented Apr 23, 2017

My motivation for this --rerun option relates to measuring CPU time of
microbenchmarks. Measuring RSS is a very different thing. As you
explained in #8793, RSS could be quite different as you increase the
number of samples in a process, and that could actually be a bug in
the runtime implementation. Someone will need to use platform tools to
inspect the resident memory. But that's a very specific issue which
we're not concerned with when it comes to CPU time.

In either case, MEAN is not very interesting until you have a very
high number of samples.

I do think it would be potentially useful to rerun certain benchmarks
until SD dimishes. But that's not what I'm proposing. One of my
issues with relying on SD, is that I sometimes get a sequence of
identical samples in a row (via --num-samples). However, rerunning at
a different time in a separate process gives different results.

Instead I'm proposing that the driver rerun some benchmarks only
after comparing two different inputs. In the case of CI, it's always
comparing ToT with the users's PR. That way, you can start with a very
small sample size, say three. Even if all three data points are nearly
identical, the benchmark will be rerun whenever the results deviate
from the other input. This is the process that I currently perform
manually before spending time investigating any potential regression,
so I know it weeds out a lot of spurious results. Automating that step
would itself save a lot of time.

@palimondo
Copy link
Mannequin

palimondo mannequin commented Apr 25, 2017

@atrick I was thinking about how you (mostly don't) use the Benchmark_Driver. I'll definitely work on this – rerun option. But it got me thinking, maybe I could make proper use of the averageing that is done in BD and implement a – cycle option, that would interpret the num-samples as number of runs of the whole test suite, while doing 1 sample of each test per cycle. So instead of the samples from one test being grouped at the same time, making them all susceptible to the same environmental interference, we could gather the same number of samples, but in different grouping. That is basically what you are doing manually, right?

@atrick
Copy link
Member Author

atrick commented Apr 26, 2017

I actually have a "repeat" script. I give it the driver command line and tell it how many times to run the command.

It makes perfect sense to have an option to do that. Ideally I would like to see two independent options:

--num-samples N
--repeat N

(--iterations is a misnomer. I'm not sure how we ended up with that name. Right @gottesmm?)

However, I still want to be able to manually rerun benchmarks and concatenate the output. Sometimes you decide to gather more data later without knowing ahead of time how many samples/repeats you need. I don't want to throw away the old data though.

@gottesmm
Copy link
Member

@atrick You are correct.

@atrick
Copy link
Member Author

atrick commented Apr 26, 2017

Regarding the ability to handle multiple entries in the output per key...
The more general point I should have made is that scripts should be modular and composable. The compare_perf script is much older than our Benchmark_Driver script (before Mishal rewrote it) and I designed it to work with virtually any benchmark driver simply by tweaking the formatting lines at the top. (I use effectively the same script for LLVM results and other benchmark drivers). The fact that Benchmark_Driver does some aggregation of results too is somewhat redundant (I originally intended all aggregation to be in the compare script). But I would prefer not to strip the compare script of its utility. I realize it gets harder to keep the script generic as it gets more sophisticated.

@palimondo
Copy link
Mannequin

palimondo mannequin commented Jan 8, 2019

Closing, as this was superseded by @eeckstein's work on run_smoke_bench.

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
@AnthonyLatsis AnthonyLatsis added feature A feature request or implementation benchmarks and removed new feature labels Jan 27, 2023
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks feature A feature request or implementation good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants