[SR-1220] LLDB test suite lost ability to pass test suite when reruns clear some issues #4411

trfiala · 2016-04-13T16:38:17Z


Previous ID	SR-1220
Radar	None
Original Reporter	@trfiala
Type	Bug
Status	Closed
Resolution	Done

Additional Detail from JIRA


Votes	0
Component/s	LLDB for Swift
Labels	Bug
Assignee	@trfiala
Priority	Medium

md5: 7e719234006d1a19a1c212ead98ea94d

Issue Description:

I am not sure exactly what is broken here, but something about the '--rerun-all-issues' and the rerun logic to clear test issues has broken. I've seen several builds where the errant test timed out and was rerun, the rerun succeeds, and then the test suite fails.

I suspect something changed upstream and this breakage came in when I brought in LLVM.org LLDB svn trunk through r265422 into GitHub swift-lldb.

This is high priority as it is causing builds to be flagged as failing when we don't intend it.

trfiala · 2016-04-13T17:00:30Z

I'm going to see if I can get some tests around the test infrastructure itself so we can guard against this feature breaking. Our CI depends heavily on it to avoid generating noise (i.e. false positives).

trfiala · 2016-04-13T17:21:04Z

This is one example of it:
https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-14_04/2753/

The symptom looks like this:

1. the test log will show the original failure (usually a timeout or a load-sensitive test failing, like TestConcurrentEvents.py or the lldb-mi ones).
2. The test will rerun in the trailing rerun phase.
3. The rerun will clearly pass.
4. The final summarization will continue to indicate there was a failure.
5. We will then fail the test run.

For this example, TestConcurrentExamples.py failed (step 1). It was rerun (step 2). It passed without any further timeout (step 3). The summarization lists the TIMEOUT at the end of the list. That is the first sign of an issue. That should have been cleared. (step 4). Step 5 likely falls out from the issue that caused step 4 to print out incorrectly.

Once I figure out what is broken, I'll see if it is also broken in LLVM.org LLDB svn trunk. It is possible this is just a manifestation of a broken merge. But it might also exist at LLVM.org.

trfiala · 2016-04-13T18:35:39Z

I'm definitely getting unexpected behavior on a forced timeout test (packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py).

When run with:
LLDB_TEST_TIMEOUT=5s test/dotest.py --executable=`xcrun -find lldb` --test-subdir issue_verification

I'm seeing it oscillate between passing and timing out, where the timeout case is not even generating a timeout event. Tracking that down now.

trfiala · 2016-04-13T18:50:04Z

This is looking like a race in test inferior status collection.

The test infrastructure sends events about what it is doing as the key means for communication as to what the test runner's many test queues are doing. When a test inferior completes, we do a wrap-up on the inferior's status to generate synthetic events like "timeout" and others that, for whatever reason, cannot be generated by the inferior itself. This also includes meta events like job queue events and similar bits.

If I introduce a small delay in dosep.send_inferior_post_run_events(), then all the issues I see go away. This would appear to imply that there is a race in the test runner where the run strategy knows the inferior is done, but we haven't yet received the process end state, so we don't know that it timed out. We thus skip the timeout cleanup, even though it was a timeout. Inserting a second delay there yields correct results 10 out of 10 times.

I'll need to diagnose the race and put in something to prevent it.

trfiala · 2016-04-13T19:04:23Z

I'm going to temporarily bump up the timeouts while we work this out, to avoid the timeouts in the first place. This will be a reasonable short-term work-around while I investigate and address the underlying issue if the timeouts are not instead deadlocks. If they're deadlocks, no amount of time is going to help them pass in the first round. We should look into adding a 'sample' (OS X) or the Linux equivalent thereof when we hit a test suite timeout so we can see if these timeouts are due to something more nefarious than just taking longer under heavy load.

trfiala · 2016-04-13T22:46:35Z

This is most definitely not only affecting timeout results. It also affects FAIL/ERROR and other categories, as witnessed here:

https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-14_04/2765
https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-15_10/3115

I'm following a lead that has potential. The test inferiors and the test inferior worker threads send test events through a socket that has listeners that run as asyncore sockets. All the test events must relay through the listening socket in the main dosep.py process, which then forwards to the final top-level ResultsFormatter. It appears as though a rerun phase with a small number of reruns can now somehow race and complete (and thus report to) the ResultsFormatter after the summarization step takes place. This means those test events do not get the opportunity to overwrite the old failing ones.

I'm investigating that lead and looking to see how I can ensure all events are fully drained before doing the summarization. If this ends up being the issue, the next question would be what is causing this to show up now?

trfiala · 2016-04-14T06:47:43Z

I think I've got at least one of the races understood now.

When we're in the rerun phase, we only run one test at a time, intentionally. This is not specific to the rerun logic, but that is a place that is likely to show up.

The run loop will exit if there are no worker threads active and there are no receiving worker sockets (asyncore-based) or the listener running. i.e. we're done when we think we're done. Except...

This is the sequence I think is failing (which I've been debugging):

1. Single worker queue running. (As is the case in the rerun logic).
2. test inferior or worker queue queues up connection and message to listener socket.
3. worker completes and goes away before listener gets an asyncore cycle to process the awaiting connection from the test inferior or worker queue job-level messages.
4. Because worker queue is done, the check that says "kill the listener if there are no more worker queues" fires off. Now we won't get the listener socket fired up, and we'll exit the main loop without having received the test event data we really need.

This would explain what I'm seeing. I'm going to figure out how to solve that.

trfiala · 2016-04-14T06:55:45Z

If this is indeed the issue, having the socket-initiating side wait for an ack on the read side before it does anything else should be sufficient. That will ensure that the producer of the test events (the test inferior or the worker thread) is still alive (and therefore keeping the run loop guaranteed to be alive) and getting the listener dosep.py socket into the asyncore IO map to keep the run loop alive.

I'm going to give that a shot.

This doesn't look like a new problem in the code. It might have been hidden because we were always running slower, but the new merge may have brought in enough speed improvements to change the probability of the race being hit.

trfiala · 2016-04-14T07:27:51Z

My ack mechanism appears to be working fine. I'm going to get it in a PR and test it on the Linux builders. The nature of the race is much more likely to occur on high-speed hardware. This will also need to go up to LLVM.org once I prove it out to be solving the issue.

It's conceivable there's something else there lurking, but I'm hopeful this is the primary issue.

trfiala · 2016-04-14T08:11:06Z

I put up this PR:
apple/swift-lldb#19

Ubuntu 14.04 passed without any issues showing up. We need a rerun to really test it, though.

Ubuntu 15.10 is just wrapping up and... it is a fail. Build #3130 on Ubuntu 15.10 (master) had a single ERROR, did the rerun, the rerun passed, but we still reported an ERROR for the summary and failed the test suite.

I'll need to repro the failing ERROR case. I did all my work with the timeouts, which definitely had a race which I patched. Looks like we have at least one more issue.

trfiala · 2016-04-14T08:32:13Z

I'm able to repro a case of not all the failures clearing. It is with timeouts again, but it shows another error:

PYTHONUNBUFFERED=1 LLDB_TEST_TIMEOUT=3 /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c

Rerunning the following files:
lang/c/anonymous/TestAnonymous.py
lang/c/step-target/TestStepTarget.py
lang/c/function_types/TestFunctionTypes.py
lang/c/array_types/TestArrayTypes.py
lang/c/shared_lib_stripped_symbols/TestSharedLibStrippedSymbols.py
lang/c/bitfields/TestBitfields.py
lang/c/shared_lib/TestSharedLib.py
rerun will use the 'threading' test runner strategy

1 out of 7 test suites processed - TestSharedLibStrippedSymbols.py
2 out of 7 test suites processed - TestSharedLib.py
3 out of 7 test suites processed - TestBitfields.py
4 out of 7 test suites processed - TestArrayTypes.py
5 out of 7 test suites processed - TestFunctionTypes.py

Session logs for test failures/errors/unexpected successes will go into directory '2016-04-14-00_21_11'
Command invoked: /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestAnonymous.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

Configuration: arch=x86_64 compiler=clang

Collected 14 tests

Cannot read termcap database;
using dumb terminal settings.

[TestAnonymous.py FAILED] (TIMEOUT)
Command invoked: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestAnonymous.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

6 out of 7 test suites processed - TestAnonymous.py

Session logs for test failures/errors/unexpected successes will go into directory '2016-04-14-00_21_11'
Command invoked: /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestStepTarget.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

Configuration: arch=x86_64 compiler=clang

Collected 10 tests

Cannot read termcap database;
using dumb terminal settings.

[TestStepTarget.py FAILED] (TIMEOUT)
Command invoked: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestStepTarget.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

7 out of 7 test suites processed - TestStepTarget.py
Test rerun complete

=============
Issue Details

TIMEOUT: test_and_run_command_dwarf (lang/c/bitfields/TestBitfields.py)
TIMEOUT: test_and_run_command_dwarf (lang/c/array_types/TestArrayTypes.py)
TIMEOUT: test_expr_child_dwarf (lang/c/anonymous/TestAnonymous.py)
TIMEOUT: test_expr_grandchild_dwarf (lang/c/anonymous/TestAnonymous.py)
TIMEOUT: test_frame_variable_dwarf (lang/c/shared_lib_stripped_symbols/TestSharedLibStrippedSymbols.py)
TIMEOUT: test_frame_variable_dwarf (lang/c/shared_lib/TestSharedLib.py)
TIMEOUT: test_pointers_dsym (lang/c/function_types/TestFunctionTypes.py)
TIMEOUT: test_with_end_line_bad_name_dsym (lang/c/step-target/TestStepTarget.py)
TIMEOUT: test_with_end_line_bad_name_dwarf (lang/c/step-target/TestStepTarget.py)

===================
Test Result Summary

Test Methods: 71
Reruns: 21
Success: 49
Expected Failure: 13
Failure: 0
Error: 0
Exceptional Exit: 0
Unexpected Success: 0
Skip: 0
Timeout: 9
Expected Timeout: 0

Process finished with exit code 1

Only TestAnonymous.py and TestStepTarget.py still had failures, while the rest came out clean, yet we still record them all as failures. It looks like 5 files should have had their errors cleared but that didn't happen.

trfiala · 2016-04-17T21:43:59Z

Coming back to this now that I'm back.

trfiala · 2016-04-17T22:15:59Z

Looks like I found the issue with timeouts. The timeout code is generating filenames with .py, the code for success is reporting .pyc. I don't think the old path used to report as .pyc, but rather .py. This is probably where this issue came from, something else changed that refactored how we get the filename for the current running test, and that broke the collation logic of test results.

Working on a fix to normalize filenames now.

I'm also going to have a look at what changed in filename retrieval, since we should be using the same underlying mechanism for both.

trfiala · 2016-04-17T23:10:46Z

I tracked this down and committed the fix here:
apple/swift-lldb#20

trfiala · 2016-04-17T23:12:09Z

Fixed here:
apple/swift-lldb@25c5f20

swift-ci transferred this issue from apple/swift-issues Apr 25, 2022

shahmishal transferred this issue from apple/swift May 5, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SR-1220] LLDB test suite lost ability to pass test suite when reruns clear some issues #4411

[SR-1220] LLDB test suite lost ability to pass test suite when reruns clear some issues #4411

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 14, 2016

trfiala mannequin commented Apr 14, 2016

trfiala mannequin commented Apr 14, 2016

trfiala mannequin commented Apr 14, 2016

trfiala mannequin commented Apr 14, 2016

trfiala mannequin commented Apr 17, 2016

trfiala mannequin commented Apr 17, 2016

trfiala mannequin commented Apr 17, 2016

trfiala mannequin commented Apr 17, 2016

Navigation Menu

[SR-1220] LLDB test suite lost ability to pass test suite when reruns clear some issues #4411

[SR-1220] LLDB test suite lost ability to pass test suite when reruns clear some issues #4411

Comments

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 13, 2016

trfiala mannequin commented Apr 14, 2016

trfiala mannequin commented Apr 14, 2016

trfiala mannequin commented Apr 14, 2016

trfiala mannequin commented Apr 14, 2016

trfiala mannequin commented Apr 14, 2016

Configuration: arch=x86_64 compiler=clang

Configuration: arch=x86_64 compiler=clang

============= Issue Details

=================== Test Result Summary

trfiala mannequin commented Apr 17, 2016

trfiala mannequin commented Apr 17, 2016

trfiala mannequin commented Apr 17, 2016

trfiala mannequin commented Apr 17, 2016

=============
Issue Details

===================
Test Result Summary