Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-1220] LLDB test suite lost ability to pass test suite when reruns clear some issues #4411

Closed
trfiala mannequin opened this issue Apr 13, 2016 · 15 comments
Closed
Assignees
Labels
bug Something isn't working LLDB for Swift

Comments

@trfiala
Copy link
Mannequin

trfiala mannequin commented Apr 13, 2016

Previous ID SR-1220
Radar None
Original Reporter @trfiala
Type Bug
Status Closed
Resolution Done
Additional Detail from JIRA
Votes 0
Component/s LLDB for Swift
Labels Bug
Assignee @trfiala
Priority Medium

md5: 7e719234006d1a19a1c212ead98ea94d

Issue Description:

I am not sure exactly what is broken here, but something about the '--rerun-all-issues' and the rerun logic to clear test issues has broken. I've seen several builds where the errant test timed out and was rerun, the rerun succeeds, and then the test suite fails.

I suspect something changed upstream and this breakage came in when I brought in LLVM.org LLDB svn trunk through r265422 into GitHub swift-lldb.

This is high priority as it is causing builds to be flagged as failing when we don't intend it.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 13, 2016

I'm going to see if I can get some tests around the test infrastructure itself so we can guard against this feature breaking. Our CI depends heavily on it to avoid generating noise (i.e. false positives).

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 13, 2016

This is one example of it:
https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-14_04/2753/

The symptom looks like this:

1. the test log will show the original failure (usually a timeout or a load-sensitive test failing, like TestConcurrentEvents.py or the lldb-mi ones).
2. The test will rerun in the trailing rerun phase.
3. The rerun will clearly pass.
4. The final summarization will continue to indicate there was a failure.
5. We will then fail the test run.

For this example, TestConcurrentExamples.py failed (step 1). It was rerun (step 2). It passed without any further timeout (step 3). The summarization lists the TIMEOUT at the end of the list. That is the first sign of an issue. That should have been cleared. (step 4). Step 5 likely falls out from the issue that caused step 4 to print out incorrectly.

Once I figure out what is broken, I'll see if it is also broken in LLVM.org LLDB svn trunk. It is possible this is just a manifestation of a broken merge. But it might also exist at LLVM.org.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 13, 2016

I'm definitely getting unexpected behavior on a forced timeout test (packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py).

When run with:
LLDB_TEST_TIMEOUT=5s test/dotest.py --executable=`xcrun -find lldb` --test-subdir issue_verification

I'm seeing it oscillate between passing and timing out, where the timeout case is not even generating a timeout event. Tracking that down now.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 13, 2016

This is looking like a race in test inferior status collection.

The test infrastructure sends events about what it is doing as the key means for communication as to what the test runner's many test queues are doing. When a test inferior completes, we do a wrap-up on the inferior's status to generate synthetic events like "timeout" and others that, for whatever reason, cannot be generated by the inferior itself. This also includes meta events like job queue events and similar bits.

If I introduce a small delay in dosep.send_inferior_post_run_events(), then all the issues I see go away. This would appear to imply that there is a race in the test runner where the run strategy knows the inferior is done, but we haven't yet received the process end state, so we don't know that it timed out. We thus skip the timeout cleanup, even though it was a timeout. Inserting a second delay there yields correct results 10 out of 10 times.

I'll need to diagnose the race and put in something to prevent it.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 13, 2016

I'm going to temporarily bump up the timeouts while we work this out, to avoid the timeouts in the first place. This will be a reasonable short-term work-around while I investigate and address the underlying issue if the timeouts are not instead deadlocks. If they're deadlocks, no amount of time is going to help them pass in the first round. We should look into adding a 'sample' (OS X) or the Linux equivalent thereof when we hit a test suite timeout so we can see if these timeouts are due to something more nefarious than just taking longer under heavy load.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 13, 2016

This is most definitely not only affecting timeout results. It also affects FAIL/ERROR and other categories, as witnessed here:

https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-14_04/2765
https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-15_10/3115

I'm following a lead that has potential. The test inferiors and the test inferior worker threads send test events through a socket that has listeners that run as asyncore sockets. All the test events must relay through the listening socket in the main dosep.py process, which then forwards to the final top-level ResultsFormatter. It appears as though a rerun phase with a small number of reruns can now somehow race and complete (and thus report to) the ResultsFormatter after the summarization step takes place. This means those test events do not get the opportunity to overwrite the old failing ones.

I'm investigating that lead and looking to see how I can ensure all events are fully drained before doing the summarization. If this ends up being the issue, the next question would be what is causing this to show up now?

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 14, 2016

I think I've got at least one of the races understood now.

When we're in the rerun phase, we only run one test at a time, intentionally. This is not specific to the rerun logic, but that is a place that is likely to show up.

The run loop will exit if there are no worker threads active and there are no receiving worker sockets (asyncore-based) or the listener running. i.e. we're done when we think we're done. Except...

This is the sequence I think is failing (which I've been debugging):

1. Single worker queue running. (As is the case in the rerun logic).
2. test inferior or worker queue queues up connection and message to listener socket.
3. worker completes and goes away before listener gets an asyncore cycle to process the awaiting connection from the test inferior or worker queue job-level messages.
4. Because worker queue is done, the check that says "kill the listener if there are no more worker queues" fires off. Now we won't get the listener socket fired up, and we'll exit the main loop without having received the test event data we really need.

This would explain what I'm seeing. I'm going to figure out how to solve that.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 14, 2016

If this is indeed the issue, having the socket-initiating side wait for an ack on the read side before it does anything else should be sufficient. That will ensure that the producer of the test events (the test inferior or the worker thread) is still alive (and therefore keeping the run loop guaranteed to be alive) and getting the listener dosep.py socket into the asyncore IO map to keep the run loop alive.

I'm going to give that a shot.

This doesn't look like a new problem in the code. It might have been hidden because we were always running slower, but the new merge may have brought in enough speed improvements to change the probability of the race being hit.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 14, 2016

My ack mechanism appears to be working fine. I'm going to get it in a PR and test it on the Linux builders. The nature of the race is much more likely to occur on high-speed hardware. This will also need to go up to LLVM.org once I prove it out to be solving the issue.

It's conceivable there's something else there lurking, but I'm hopeful this is the primary issue.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 14, 2016

I put up this PR:
apple/swift-lldb#19

Ubuntu 14.04 passed without any issues showing up. We need a rerun to really test it, though.

Ubuntu 15.10 is just wrapping up and... it is a fail. Build #3130 on Ubuntu 15.10 (master) had a single ERROR, did the rerun, the rerun passed, but we still reported an ERROR for the summary and failed the test suite.

I'll need to repro the failing ERROR case. I did all my work with the timeouts, which definitely had a race which I patched. Looks like we have at least one more issue.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 14, 2016

I'm able to repro a case of not all the failures clearing. It is with timeouts again, but it shows another error:

PYTHONUNBUFFERED=1 LLDB_TEST_TIMEOUT=3 /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c

Rerunning the following files:
lang/c/anonymous/TestAnonymous.py
lang/c/step-target/TestStepTarget.py
lang/c/function_types/TestFunctionTypes.py
lang/c/array_types/TestArrayTypes.py
lang/c/shared_lib_stripped_symbols/TestSharedLibStrippedSymbols.py
lang/c/bitfields/TestBitfields.py
lang/c/shared_lib/TestSharedLib.py
rerun will use the 'threading' test runner strategy

1 out of 7 test suites processed - TestSharedLibStrippedSymbols.py
2 out of 7 test suites processed - TestSharedLib.py
3 out of 7 test suites processed - TestBitfields.py
4 out of 7 test suites processed - TestArrayTypes.py
5 out of 7 test suites processed - TestFunctionTypes.py

Session logs for test failures/errors/unexpected successes will go into directory '2016-04-14-00_21_11'
Command invoked: /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestAnonymous.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

Configuration: arch=x86_64 compiler=clang

Collected 14 tests

Cannot read termcap database;
using dumb terminal settings.

[TestAnonymous.py FAILED] (TIMEOUT)
Command invoked: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestAnonymous.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

6 out of 7 test suites processed - TestAnonymous.py

Session logs for test failures/errors/unexpected successes will go into directory '2016-04-14-00_21_11'
Command invoked: /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestStepTarget.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

Configuration: arch=x86_64 compiler=clang

Collected 10 tests

Cannot read termcap database;
using dumb terminal settings.

[TestStepTarget.py FAILED] (TIMEOUT)
Command invoked: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestStepTarget.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

7 out of 7 test suites processed - TestStepTarget.py
Test rerun complete

=============
Issue Details

TIMEOUT: test_and_run_command_dwarf (lang/c/bitfields/TestBitfields.py)
TIMEOUT: test_and_run_command_dwarf (lang/c/array_types/TestArrayTypes.py)
TIMEOUT: test_expr_child_dwarf (lang/c/anonymous/TestAnonymous.py)
TIMEOUT: test_expr_grandchild_dwarf (lang/c/anonymous/TestAnonymous.py)
TIMEOUT: test_frame_variable_dwarf (lang/c/shared_lib_stripped_symbols/TestSharedLibStrippedSymbols.py)
TIMEOUT: test_frame_variable_dwarf (lang/c/shared_lib/TestSharedLib.py)
TIMEOUT: test_pointers_dsym (lang/c/function_types/TestFunctionTypes.py)
TIMEOUT: test_with_end_line_bad_name_dsym (lang/c/step-target/TestStepTarget.py)
TIMEOUT: test_with_end_line_bad_name_dwarf (lang/c/step-target/TestStepTarget.py)

===================
Test Result Summary

Test Methods: 71
Reruns: 21
Success: 49
Expected Failure: 13
Failure: 0
Error: 0
Exceptional Exit: 0
Unexpected Success: 0
Skip: 0
Timeout: 9
Expected Timeout: 0

Process finished with exit code 1

Only TestAnonymous.py and TestStepTarget.py still had failures, while the rest came out clean, yet we still record them all as failures. It looks like 5 files should have had their errors cleared but that didn't happen.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 17, 2016

Coming back to this now that I'm back.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 17, 2016

Looks like I found the issue with timeouts. The timeout code is generating filenames with .py, the code for success is reporting .pyc. I don't think the old path used to report as .pyc, but rather .py. This is probably where this issue came from, something else changed that refactored how we get the filename for the current running test, and that broke the collation logic of test results.

Working on a fix to normalize filenames now.

I'm also going to have a look at what changed in filename retrieval, since we should be using the same underlying mechanism for both.

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 17, 2016

I tracked this down and committed the fix here:
apple/swift-lldb#20

@trfiala
Copy link
Mannequin Author

trfiala mannequin commented Apr 17, 2016

Fixed here:
apple/swift-lldb@25c5f20

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
@shahmishal shahmishal transferred this issue from apple/swift May 5, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working LLDB for Swift
Projects
None yet
Development

No branches or pull requests

0 participants