New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SR-1220] LLDB test suite lost ability to pass test suite when reruns clear some issues #4411
Comments
I'm going to see if I can get some tests around the test infrastructure itself so we can guard against this feature breaking. Our CI depends heavily on it to avoid generating noise (i.e. false positives). |
This is one example of it: The symptom looks like this: 1. the test log will show the original failure (usually a timeout or a load-sensitive test failing, like TestConcurrentEvents.py or the lldb-mi ones). For this example, TestConcurrentExamples.py failed (step 1). It was rerun (step 2). It passed without any further timeout (step 3). The summarization lists the TIMEOUT at the end of the list. That is the first sign of an issue. That should have been cleared. (step 4). Step 5 likely falls out from the issue that caused step 4 to print out incorrectly. Once I figure out what is broken, I'll see if it is also broken in LLVM.org LLDB svn trunk. It is possible this is just a manifestation of a broken merge. But it might also exist at LLVM.org. |
I'm definitely getting unexpected behavior on a forced timeout test (packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py). When run with: I'm seeing it oscillate between passing and timing out, where the timeout case is not even generating a timeout event. Tracking that down now. |
This is looking like a race in test inferior status collection. The test infrastructure sends events about what it is doing as the key means for communication as to what the test runner's many test queues are doing. When a test inferior completes, we do a wrap-up on the inferior's status to generate synthetic events like "timeout" and others that, for whatever reason, cannot be generated by the inferior itself. This also includes meta events like job queue events and similar bits. If I introduce a small delay in dosep.send_inferior_post_run_events(), then all the issues I see go away. This would appear to imply that there is a race in the test runner where the run strategy knows the inferior is done, but we haven't yet received the process end state, so we don't know that it timed out. We thus skip the timeout cleanup, even though it was a timeout. Inserting a second delay there yields correct results 10 out of 10 times. I'll need to diagnose the race and put in something to prevent it. |
I'm going to temporarily bump up the timeouts while we work this out, to avoid the timeouts in the first place. This will be a reasonable short-term work-around while I investigate and address the underlying issue if the timeouts are not instead deadlocks. If they're deadlocks, no amount of time is going to help them pass in the first round. We should look into adding a 'sample' (OS X) or the Linux equivalent thereof when we hit a test suite timeout so we can see if these timeouts are due to something more nefarious than just taking longer under heavy load. |
This is most definitely not only affecting timeout results. It also affects FAIL/ERROR and other categories, as witnessed here: https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-14_04/2765 I'm following a lead that has potential. The test inferiors and the test inferior worker threads send test events through a socket that has listeners that run as asyncore sockets. All the test events must relay through the listening socket in the main dosep.py process, which then forwards to the final top-level ResultsFormatter. It appears as though a rerun phase with a small number of reruns can now somehow race and complete (and thus report to) the ResultsFormatter after the summarization step takes place. This means those test events do not get the opportunity to overwrite the old failing ones. I'm investigating that lead and looking to see how I can ensure all events are fully drained before doing the summarization. If this ends up being the issue, the next question would be what is causing this to show up now? |
I think I've got at least one of the races understood now. When we're in the rerun phase, we only run one test at a time, intentionally. This is not specific to the rerun logic, but that is a place that is likely to show up. The run loop will exit if there are no worker threads active and there are no receiving worker sockets (asyncore-based) or the listener running. i.e. we're done when we think we're done. Except... This is the sequence I think is failing (which I've been debugging): 1. Single worker queue running. (As is the case in the rerun logic). This would explain what I'm seeing. I'm going to figure out how to solve that. |
If this is indeed the issue, having the socket-initiating side wait for an ack on the read side before it does anything else should be sufficient. That will ensure that the producer of the test events (the test inferior or the worker thread) is still alive (and therefore keeping the run loop guaranteed to be alive) and getting the listener dosep.py socket into the asyncore IO map to keep the run loop alive. I'm going to give that a shot. This doesn't look like a new problem in the code. It might have been hidden because we were always running slower, but the new merge may have brought in enough speed improvements to change the probability of the race being hit. |
My ack mechanism appears to be working fine. I'm going to get it in a PR and test it on the Linux builders. The nature of the race is much more likely to occur on high-speed hardware. This will also need to go up to LLVM.org once I prove it out to be solving the issue. It's conceivable there's something else there lurking, but I'm hopeful this is the primary issue. |
I put up this PR: Ubuntu 14.04 passed without any issues showing up. We need a rerun to really test it, though. Ubuntu 15.10 is just wrapping up and... it is a fail. Build #3130 on Ubuntu 15.10 (master) had a single ERROR, did the rerun, the rerun passed, but we still reported an ERROR for the summary and failed the test suite. I'll need to repro the failing ERROR case. I did all my work with the timeouts, which definitely had a race which I patched. Looks like we have at least one more issue. |
I'm able to repro a case of not all the failures clearing. It is with timeouts again, but it shows another error: PYTHONUNBUFFERED=1 LLDB_TEST_TIMEOUT=3 /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c Rerunning the following files: 1 out of 7 test suites processed - TestSharedLibStrippedSymbols.py Session logs for test failures/errors/unexpected successes will go into directory '2016-04-14-00_21_11' Configuration: arch=x86_64 compiler=clangCollected 14 tests Cannot read termcap database; [TestAnonymous.py FAILED] (TIMEOUT) 6 out of 7 test suites processed - TestAnonymous.py Session logs for test failures/errors/unexpected successes will go into directory '2016-04-14-00_21_11' Configuration: arch=x86_64 compiler=clangCollected 10 tests Cannot read termcap database; [TestStepTarget.py FAILED] (TIMEOUT) 7 out of 7 test suites processed - TestStepTarget.py =============
|
Coming back to this now that I'm back. |
Looks like I found the issue with timeouts. The timeout code is generating filenames with .py, the code for success is reporting .pyc. I don't think the old path used to report as .pyc, but rather .py. This is probably where this issue came from, something else changed that refactored how we get the filename for the current running test, and that broke the collation logic of test results. Working on a fix to normalize filenames now. I'm also going to have a look at what changed in filename retrieval, since we should be using the same underlying mechanism for both. |
I tracked this down and committed the fix here: |
Fixed here: |
Additional Detail from JIRA
md5: 7e719234006d1a19a1c212ead98ea94d
Issue Description:
I am not sure exactly what is broken here, but something about the '--rerun-all-issues' and the rerun logic to clear test issues has broken. I've seen several builds where the errant test timed out and was rerun, the rerun succeeds, and then the test suite fails.
I suspect something changed upstream and this breakage came in when I brought in LLVM.org LLDB svn trunk through r265422 into GitHub swift-lldb.
This is high priority as it is causing builds to be flagged as failing when we don't intend it.
The text was updated successfully, but these errors were encountered: