Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-4567] Race condition in URLSession causes a hang #3876

Closed
pushkarnk opened this issue Apr 12, 2017 · 3 comments
Closed

[SR-4567] Race condition in URLSession causes a hang #3876

pushkarnk opened this issue Apr 12, 2017 · 3 comments
Assignees

Comments

@pushkarnk
Copy link
Collaborator

Previous ID SR-4567
Radar None
Original Reporter @pushkarnk
Type Bug
Status Closed
Resolution Done

Attachment: Download

Additional Detail from JIRA
Votes 0
Component/s Foundation
Labels Bug
Assignee @pushkarnk
Priority Medium

md5: 6d5db0ffb21f312b377c92601469a6e1

Issue Description:

The attached test app that makes thousands of API calls to the GitHub API eventually hangs if the timeout on each URLSessionTask is sufficiently high. In the case of a low (or default) timeout value many URLSessionTasks fail with "request timed out" URLError. This hang happens very frequently on Linux but is very intermittent on macOS with swift-corelibs-foundation. No hang/timeout is seen on Darwin with ObjC Foundation.

Note: the test case needs GitHub auth tokens to run.

Passing case:
The number of queued tasks is equal to the number of finished tasks. There should be no instances of URLError.

Queued 5048 tasks until now
...
...
Finished a total of 5048 tasks!

Failing case:
The number of queued tasks is equal to the number of finished tasks. There may be a hang and/or URLError instances in the output.

Queued 934 tasks until now
....
....
....
Finished a total of 332 tasks!
<hang>
@pushkarnk
Copy link
Collaborator Author

Problem analysis:

  1. I reproduced the hang on Linux running the app inside lldb and retrieved the backtraces of all threads:
(lldb) bt all
* thread #&#8203;1: tid = 5333, 0x00007ffff5bf3826 libc.so.6`sigsuspend + 86, name = 'main', stop reason = signal SIGSTOP
  * frame #&#8203;0: 0x00007ffff5bf3826 libc.so.6`sigsuspend + 86
    frame #&#8203;1: 0x00007ffff7f1e608 libdispatch.so`_dispatch_sigsuspend + 24
    frame #&#8203;2: 0x00007ffff7f1ca56 libdispatch.so`_dispatch_sig_thread + 6
    frame #&#8203;3: 0x00007ffff6c35439 libpthread.so.0`__nptl_deallocate_tsd.part.4 + 137
    frame #&#8203;4: 0x00007ffff5bde854 libc.so.6`__libc_start_main + 276
    frame #&#8203;5: 0x0000000000403659 main`_start + 41

  thread #&#8203;2: tid = 6379, 0x00007ffff6c3ea15 libpthread.so.0`do_futex_wait + 69, name = 'main'
    frame #&#8203;0: 0x00007ffff6c3ea15 libpthread.so.0`do_futex_wait + 69
    frame #&#8203;1: 0x00007ffff6c3eadf libpthread.so.0`__new_sem_wait_slow + 95
    frame #&#8203;2: 0x00007ffff6c3eb92 libpthread.so.0`sem_timedwait + 66

  thread #&#8203;3: tid = 6382, 0x00007ffff6c3c709 libpthread.so.0`__pthread_cond_timedwait + 297, name = 'main'
    frame #&#8203;0: 0x00007ffff6c3c709 libpthread.so.0`__pthread_cond_timedwait + 297
    frame #&#8203;1: 0x00007ffff7f3aab4 libdispatch.so`overcommit_worker_main(unused=<unavailable>) + 340 at manager.c:326
    frame #&#8203;2: 0x00007ffff6c366ba libpthread.so.0`start_thread + 202
    frame #&#8203;3: 0x00007ffff5cc482d libc.so.6`clone + 109

  thread #&#8203;6: tid = 6387, 0x00007ffff5cc4e23 libc.so.6`epoll_wait + 51, name = 'main'
    frame #&#8203;0: 0x00007ffff5cc4e23 libc.so.6`epoll_wait + 51
    frame #&#8203;1: 0x00007ffff7f25955 libdispatch.so`_dispatch_event_loop_drain + 53
    frame #&#8203;2: 0x00007ffff7f218e8 libdispatch.so`_dispatch_mgr_invoke + 136
    frame #&#8203;3: 0x00007ffff7f216ab libdispatch.so`_dispatch_mgr_thread + 11
    frame #&#8203;4: 0x00007ffff7f1d9a4 libdispatch.so`_dispatch_root_queue_drain + 388
    frame #&#8203;5: 0x00007ffff7f1d798 libdispatch.so`_dispatch_worker_thread + 440
    frame #&#8203;6: 0x00007ffff6c366ba libpthread.so.0`start_thread + 202
    frame #&#8203;7: 0x00007ffff5cc482d libc.so.6`clone + 109

  thread #&#8203;9: tid = 6391, 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192, name = 'main'
    frame #&#8203;0: 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192
    frame #&#8203;1: 0x00007ffff7f3bb0b libdispatch.so`worker_main [inlined] wqlist_scan_wait + 17 at manager.c:451
    frame #&#8203;2: 0x00007ffff7f3bafa libdispatch.so`worker_main(unused=<unavailable>) + 442 at manager.c:493
    frame #&#8203;3: 0x00007ffff6c366ba libpthread.so.0`start_thread + 202
    frame #&#8203;4: 0x00007ffff5cc482d libc.so.6`clone + 109

  thread #&#8203;13: tid = 6420, 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192, name = 'main'
    frame #&#8203;0: 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192
    frame #&#8203;1: 0x00007ffff7f3bb0b libdispatch.so`worker_main [inlined] wqlist_scan_wait + 17 at manager.c:451
    frame #&#8203;2: 0x00007ffff7f3bafa libdispatch.so`worker_main(unused=<unavailable>) + 442 at manager.c:493
    frame #&#8203;3: 0x00007ffff6c366ba libpthread.so.0`start_thread + 202
    frame #&#8203;4: 0x00007ffff5cc482d libc.so.6`clone + 109

  thread #&#8203;14: tid = 6434, 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192, name = 'main'
    frame #&#8203;0: 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192
    frame #&#8203;1: 0x00007ffff7f3bb0b libdispatch.so`worker_main [inlined] wqlist_scan_wait + 17 at manager.c:451
    frame #&#8203;2: 0x00007ffff7f3bafa libdispatch.so`worker_main(unused=<unavailable>) + 442 at manager.c:493
    frame #&#8203;3: 0x00007ffff6c366ba libpthread.so.0`start_thread + 202
    frame #&#8203;4: 0x00007ffff5cc482d libc.so.6`clone + 109

  thread #&#8203;18: tid = 6461, 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192, name = 'main'
    frame #&#8203;0: 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192
    frame #&#8203;1: 0x00007ffff7f3bb0b libdispatch.so`worker_main [inlined] wqlist_scan_wait + 17 at manager.c:451
    frame #&#8203;2: 0x00007ffff7f3bafa libdispatch.so`worker_main(unused=<unavailable>) + 442 at manager.c:493
    frame #&#8203;3: 0x00007ffff6c366ba libpthread.so.0`start_thread + 202
    frame #&#8203;4: 0x00007ffff5cc482d libc.so.6`clone + 109

  thread #&#8203;19: tid = 6465, 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192, name = 'main'
    frame #&#8203;0: 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192
    frame #&#8203;1: 0x00007ffff7f3bb0b libdispatch.so`worker_main [inlined] wqlist_scan_wait + 17 at manager.c:451
    frame #&#8203;2: 0x00007ffff7f3bafa libdispatch.so`worker_main(unused=<unavailable>) + 442 at manager.c:493
    frame #&#8203;3: 0x00007ffff6c366ba libpthread.so.0`start_thread + 202
    frame #&#8203;4: 0x00007ffff5cc482d libc.so.6`clone + 109

  thread #&#8203;20: tid = 6470, 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192, name = 'main'
    frame #&#8203;0: 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192
    frame #&#8203;1: 0x00007ffff7f3bb0b libdispatch.so`worker_main [inlined] wqlist_scan_wait + 17 at manager.c:451
    frame #&#8203;2: 0x00007ffff7f3bafa libdispatch.so`worker_main(unused=<unavailable>) + 442 at manager.c:493
    frame #&#8203;3: 0x00007ffff6c366ba libpthread.so.0`start_thread + 202
    frame #&#8203;4: 0x00007ffff5cc482d libc.so.6`clone + 109

  thread #&#8203;21: tid = 6475, 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192, name = 'main'
    frame #&#8203;0: 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192
    frame #&#8203;1: 0x00007ffff7f3bb0b libdispatch.so`worker_main [inlined] wqlist_scan_wait + 17 at manager.c:451
    frame #&#8203;2: 0x00007ffff7f3bafa libdispatch.so`worker_main(unused=<unavailable>) + 442 at manager.c:493
    frame #&#8203;3: 0x00007ffff6c366ba libpthread.so.0`start_thread + 202
    frame #&#8203;4: 0x00007ffff5cc482d libc.so.6`clone + 109

  thread #&#8203;22: tid = 6480, 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192, name = 'main'
    frame #&#8203;0: 0x00007ffff6c3c360 libpthread.so.0`__pthread_cond_wait + 192
    frame #&#8203;1: 0x00007ffff7f3bb0b libdispatch.so`worker_main [inlined] wqlist_scan_wait + 17 at manager.c:451
    frame #&#8203;2: 0x00007ffff7f3bafa libdispatch.so`worker_main(unused=<unavailable>) + 442 at manager.c:493
    frame #&#8203;3: 0x00007ffff6c366ba libpthread.so.0`start_thread + 202
    frame #&#8203;4: 0x00007ffff5cc482d libc.so.6`clone + 109

  thread #&#8203;23: tid = 7985, 0x00007ffff6c3c709 libpthread.so.0`__pthread_cond_timedwait + 297, name = 'main'
    frame #&#8203;0: 0x00007ffff6c3c709 libpthread.so.0`__pthread_cond_timedwait + 297
    frame #&#8203;1: 0x00007ffff7f3aab4 libdispatch.so`overcommit_worker_main(unused=<unavailable>) + 340 at manager.c:326
    frame #&#8203;2: 0x00007ffff6c366ba libpthread.so.0`start_thread + 202
    frame #&#8203;3: 0x00007ffff5cc482d libc.so.6`clone + 109

It appeared that all of the dispatch worker threads were waiting for events to occur. This also meant there could be a lack of
responses from the GitHub endpoint. But that was very unlikely to happen at the seen frequency

2. I then got in touch with Dave Grove for a libdispatch exaplanation of the frozen state of the application. Dave found that all the dispatch queues were empty - which meant we were waiting forever for an event to occur.

3. Further, working with a libdispatch from Dave Grove (that had a lot of debug prints) and a debug Foundation, I was able to find a failure pattern:

foundation: Registering #&#8203;14 for unregister
foundation: canceling read source
foundation: Registering #&#8203;14 for registerRead
libdispatch: called _dispatch_unote_unregister
<hang>

…
foundation: calling the handler we installed, now performing action on the socket
foundation: Registering #&#8203;14 for unregister
foundation: canceling read source
foundation: Registering #&#8203;14 for registerRead
libdispatch: called _dispatch_unote_unregister
<hang>

…
foundation: calling the handler we installed, now performing action on the socket
foundation: Registering #&#8203;14 for unregister
foundation: canceling read source
foundation: Registering #&#8203;14 for registerRead
libdispatch: called _dispatch_unote_unregister
<hang>

…
foundation: Registering #&#8203;14 for unregister
foundation: canceling read source
foundation: Registering #&#8203;14 for registerRead
libdispatch: called _dispatch_unote_unregister
Finished a total of 235 tasks!
<hang>

…
foundation: calling the handler we installed, now performing action on the socket
foundation: Registering #&#8203;14 for unregister
foundation: canceling read source
foundation: Registering #&#8203;14 for registerRead
libdispatch: called _dispatch_unote_unregister
Finished a total of 130 tasks!
<hang>

4. One thing to note in the above pattern is this sequence of events across two different queues:
event 1: we tell libdispatch to unregister fd # 14 on queue 1
event 2: libdispatch begins the unregister process on queue 2
event 3: we tell libdispatch to register fd # 14 on queue 1
event 4: libdispatch completes the unregister operation on queue 2

Now, because a register operation got interleaved with an unfinished unregsiter operation on th same file descriptor, the epoll_ctl for the register operation was missed. This led to the situation where we were interested in reading on an FD, but we failed to register it with the epoll_set. Hence, even though data was available to be read on that file descriptor(socket), that particular event was never delivered. We hence had many such url session tasks starving and eventually timing out.

@pushkarnk
Copy link
Collaborator Author

Code understanding:
The URLSession implementation in swift-corelibs-foundation uses libcurl for HTTP support and libdispatch as the events library. There is a unique way in which Foundation, Dispatch and libcurl function and they interplay can lead to race conditions which could cause
a hang or URLSessionTasks timing out with no response. I try to explain this problem in words here.

Whenever libcurl opens a connection for Foundation, it notifies the latter about the socket it created and the events Foundation should be expecting on that socket, via a callback. So, after making a POST call, for example, libcurl will tell Foundation that I have assigned socket with FD number X(say) and you should track this socket for the read event - because data must be available to read on it in the near future.

In response, Foundation registers this socket for the notified event with Dispatch. For example, in the above case, Foundation will tell Dispatch that when a read event occurs on fd# X, please run this handler. The handler typically does the socket reading, during which libcurl will again be invoked. Once the read is over and libcurl thinks there's nothing more to be done on socket with fd# X, it tells Foundation to unregister fd# X from libdispatch.

Foundation implements the registering and unregistering of fds with libdispatch using the DispatchSource API. On registering for a read, we create a DispatchReadSource and provide an event handler (which will do the socket read eventually). For unregistering, we would want to cancel the DispatchSource. However, the DispatchSource API clearly mentions that a cancel() operation is completed only
when a cancel handler is run. See: https://developer.apple.com/reference/dispatch/dispatchsourceprotocol/1780783-cancel
Note: we do not install a cancel handler currently

Now, libcurl always tries to reuse connections. So, if one is making a series of API calls to the same endpoint, it is very likely that the same TCP level connection will be reused. There is a provision to have a new connection on every transfer, but that obviously leads
to very poor performance.

Given this background, when an app is making close to 5000 API calls, back to back, to the same endpoint, we are going to be reusing the same socket (fd) again and again. Then, it is possible that Foundation get an unregister callback immediately followed by a register callback. The unregister callback leads to DispatchSource.cancel() which returns immediately. But the actual epoll_ctl with EPOLL_CTL_DEL happens asynchronously, on a different worker thread. If, after a DispatchSource.cancel() but before the EPOLL_CTL_DEL, libcurl asks us to register the same file descriptor, the EPOLL_CTL_ADD or EPOLL_CTL_MOD goes missing.

Hence, in the interplay of Foundation, libcurl and an epoll-based Dispatch, a unique race condition can lead to a failure to register sockets, effectively leading to a hang or a timeout.

Proposed solution:

We've tried half a dozen solutions to fix the race conditions but many of them led to other race conditions. The best solution we're left with is to synchronize the register and unregister operations. An unregister operation is flagged as "completed" only when the related "cancel handler" runs. Also, the register/unregister code needs to be moved to a queue different from the MultiHandle workqueue to avoid a certain deadlock condition.

@pushkarnk
Copy link
Collaborator Author

#949

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
@shahmishal shahmishal transferred this issue from apple/swift May 5, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant