Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-871] Benchmark runner misreports minimum values #43483

Closed
lorentey opened this issue Mar 3, 2016 · 8 comments
Closed

[SR-871] Benchmark runner misreports minimum values #43483

lorentey opened this issue Mar 3, 2016 · 8 comments
Assignees
Labels
benchmarks bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. performance

Comments

@lorentey
Copy link
Member

lorentey commented Mar 3, 2016

Previous ID SR-871
Radar None
Original Reporter @lorentey
Type Bug
Status Resolved
Resolution Done

Attachment: Download

Additional Detail from JIRA
Votes 0
Component/s Project Infrastructure
Labels Bug, Benchmark, Performance
Assignee @gottesmm
Priority Medium

md5: c88582f5d81eeaf1671d9f61a060a88f

Issue Description:

While looking at some benchmarking anomalies for PR #1194, I noticed that benchmark results are sometimes logged with minimum values that are far greater than either the maximum or mean.

I've noticed these appearing on stdout before, but I assumed they were an artifact of the code that prints the results, and would not lead to miscalculated results. I was wrong.

I attached an example log containing numerous such mistakes. (I increased the number of samples to 25 to get more stable results; otherwise I haven't touched the benchmarking infrastructure.)

For example, look at the Histogram line in the snippet below:

33,HashTest,25,1700,1785,1731,0,1731,10512138
34,Histogram,25,1023,761,756,0,756,3915940
35,Integrate,25,244,253,248,0,248,3910697

Since benchmark comparisons compare minimum values only, this bug makes all results produced by the current version of the benchmark suite highly suspect. For example, here is a diff for the snippet above:

 33 HashTest                              1697             1700        +3     +0.2%     1.00x (?)
 34 Histogram                              730             1023      +293    +40.1%     0.71x
 35 Integrate                              242              244        +2     +0.8%     0.99x (?)
@jckarter
Copy link
Member

jckarter commented Mar 3, 2016

Any ideas, Michael?

@lorentey
Copy link
Member Author

lorentey commented Mar 3, 2016

For what it's worth, I haven't yet seen such output when I run Benchmark_O directly.
It does occur a handful of times in every Benchmark_Driver-generated log file that I've looked at so far.

@gottesmm
Copy link
Member

gottesmm commented Mar 3, 2016

This is bizarre since the code that calculates the minimum is:

 var samples = [UInt64](count: c.numSamples, repeatedValue: 0)

  let sampler = SampleRunner()
  for s in 0..<c.numSamples {
    let time_per_sample: UInt64 = 1_000_000_000 * UInt64(c.iterationScale)

    var scale : UInt
    var elapsed_time : UInt64 = 0
    // Compute the scaling factor if a fixed c.fixedNumIters is not specified.
    scale = c.fixedNumIters

    // Rerun the test with the computed scale factor.
    elapsed_time = sampler.run(name, fn: fn, num_iters: scale)
    ...
    // save result in microseconds or k-ticks
    samples[s] = elapsed_time / UInt64(scale) / 1000
    ...
  }

  ...

  // Return our benchmark results.
  return BenchResults(delim: c.delim, sampleCount: UInt64(samples.count),
                      min: samples.minElement()!, max: samples.maxElement()!,
                      mean: mean, sd: sd, median: internalMedian(samples))
}

If the min element is greater than the maximum element then it would suggest that Array.minElement/Array.maxElement are incorrect.

EDITED: Formatted the pasted code.

@gottesmm
Copy link
Member

gottesmm commented Mar 3, 2016

It may be higher than the raw swift driver though. I will add some asserts that checks bounds to the swift driver. If it doesn't trigger it means that a script on top is misunderstanding the output of the low level driver.

@lorentey
Copy link
Member Author

lorentey commented Mar 3, 2016

Gotcha! It's because Benchmark_Driver uses string comparison to calculate the minimum/maximum:

    avg_test_output[NUM_SAMPLES_INDEX] = num_samples
    avg_test_output[MIN_INDEX] = min(test_outputs,
                                     key=lambda x: x[MIN_INDEX])[MIN_INDEX]
    avg_test_output[MAX_INDEX] = max(test_outputs,
                                     key=lambda x: x[MAX_INDEX])[MAX_INDEX]
    avg_test_output = map(str, avg_test_output)

The cases I've seen confirm that a lexical comparison caused them.

@lorentey
Copy link
Member Author

lorentey commented Mar 3, 2016

I'll create a pull request in a few moments. (My Python is a bit rusty.)

@gottesmm
Copy link
Member

gottesmm commented Mar 3, 2016

Simple unit testing of the script would have caught this bug. I filed:

https://bugs.swift.org/browse/SR-873

@lorentey
Copy link
Member Author

lorentey commented Mar 3, 2016

Submitted fix as PR #1530

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. performance
Projects
None yet
Development

No branches or pull requests

3 participants