Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-11557] URLSession closing unrelated file descriptors #3286

Closed
weissi opened this issue Oct 1, 2019 · 12 comments
Closed

[SR-11557] URLSession closing unrelated file descriptors #3286

weissi opened this issue Oct 1, 2019 · 12 comments

Comments

@weissi
Copy link
Member

weissi commented Oct 1, 2019

Previous ID SR-11557
Radar rdar://55906172
Original Reporter @weissi
Type Bug
Status Resolved
Resolution Done

Attachment: Download

Additional Detail from JIRA
Votes 3
Component/s Foundation
Labels Bug, 5.1Regression
Assignee @millenomi
Priority Medium

md5: 2a28abe2dfef5c54655ad2a7c6ac999c

Issue Description:

This is a very serious issue where in Swift 5.1, corelibs-foundation's URLSession is closing unrelated file descriptors, leading to arbitrary undefined behaviour. Swift 5.0.x do not seem to have this issue.

The attached (single file but depending on SwiftNIO to have a web server) SwiftPM project uses URLSession to continuously make requests to the web server that the projects itself spawns too.

A successful run should look like

=== ROUND 1 ===
........................................................................................................................................................................................................done
=== ROUND 2 ===
........................................................................................................................................................................................................done
=== ROUND 3 ===
........................................................................................................................................................................................................done
=== ROUND 4 ===
........................................................................................................................................................................................................done
=== ROUND 5 ===
........................................................................................................................................................................................................done
=== ROUND 6 ===
........................................................................................................................................................................................................done
[ continues forever ]

which it does on macOS (as well as on 5.0.3 on Linux).

On Linux however, the program usually gets stuck but arbitrary bad things can happen. The problem is best seen when started as

swift build && strace -ff .build/debug/URLSessionRandoCloseRepro 2>&1 | grep --line-buffered -B5 EBADF

which greps for syscalls returning EBADF (which should never happen).

Running it however leads to:

[pid  4474] <... dup resumed> )         = 59
[pid  4482] <... clock_gettime resumed> {tv_sec=48178, tv_nsec=154579697}) = 0
[pid  4474] close(7 <unfinished ...>
[pid  4482] sendto(7, "GET / HTTP/1.1\r\nHost: localhost:"..., 218, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
[pid  4474] <... close resumed> )       = 0
[pid  4482] <... sendto resumed> )      = -1 EBADF (Bad file descriptor)
--
[pid  4474] <... dup resumed> )         = 10
[pid  4481] <... clock_gettime resumed> {tv_sec=48182, tv_nsec=195487097}) = 0
[pid  4474] dup(1 <unfinished ...>
[pid  4481] sendto(44, "GET / HTTP/1.1\r\nHost: localhost:"..., 218, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
[pid  4474] <... dup resumed> )         = 12
[pid  4481] <... sendto resumed> )      = -1 EBADF (Bad file descriptor)
--
[pid  4474] close(50 <unfinished ...>
[pid  4476] <... clock_gettime resumed> {tv_sec=48188, tv_nsec=404958097}) = 0
[pid  4474] <... close resumed> )       = 0
[pid  4476] sendto(23, "GET / HTTP/1.1\r\nHost: localhost:"..., 218, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
[pid  4474] close(34)                   = 0
[pid  4476] <... sendto resumed> )      = -1 EBADF (Bad file descriptor)
--
[pid  4485] futex(0x7fcc61109260, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1569966418, tv_nsec=881007200}, 0xffffffff <unfinished ...>
[pid  4474] close(37 <unfinished ...>
[pid  4479] fstat(9,  <unfinished ...>
[pid  4474] <... close resumed> )       = 0
[pid  4475] <... timerfd_settime resumed> NULL) = 0
[pid  4479] <... fstat resumed> 0x7fcc54974ae8) = -1 EBADF (Bad file descriptor)
--
[pid  4491] <... openat resumed> )      = 7
[pid  4474] <... close resumed> )       = 0
[pid  4491] fstat(7,  <unfinished ...>
[pid  4474] close(9 <unfinished ...>
[pid  4491] <... fstat resumed> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid  4474] <... close resumed> )       = -1 EBADF (Bad file descriptor)
--
[pid  4474] close(12 <unfinished ...>
[pid  4482] <... clock_gettime resumed> {tv_sec=48267, tv_nsec=918388397}) = 0
[pid  4474] <... close resumed> )       = 0
[pid  4482] sendto(7, "GET / HTTP/1.1\r\nHost: localhost:"..., 218, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
[pid  4474] close(14 <unfinished ...>
[pid  4482] <... sendto resumed> )      = -1 EBADF (Bad file descriptor)
--
[pid  4489] <... clock_gettime resumed> {tv_sec=48268, tv_nsec=26907697}) = 0
[pid  4474] <... close resumed> )       = 0
[pid  4474] close(7 <unfinished ...>
[pid  4489] sendto(7, "GET / HTTP/1.1\r\nHost: localhost:"..., 218, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
[pid  4474] <... close resumed> )       = 0
[pid  4489] <... sendto resumed> )      = -1 EBADF (Bad file descriptor)
--
[pid  4474] <... close resumed> )       = 0
[pid  4489] <... clock_gettime resumed> {tv_sec=48268, tv_nsec=433895397}) = 0
[pid  4474] close(7 <unfinished ...>
[pid  4489] sendto(7, "GET / HTTP/1.1\r\nHost: localhost:"..., 218, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
[pid  4474] <... close resumed> )       = 0
[pid  4489] <... sendto resumed> )      = -1 EBADF (Bad file descriptor)

which shows severe file descriptor mismanagement in many cases.

One line repro from macOS:

tar xf URLSessionRandoCloseRepro.tar.gz
cd URLSessionRandoCloseRepro
docker run -it --rm --privileged -v "$PWD:$PWD" -w "$PWD" swift:5.1 bash -c 'apt-get -yy update && apt-get -yy install strace && swift build && strace -ff .build/debug/URLSessionRandoCloseRepro 2>&1 | grep --line-buffered -B5 EBADF'

Many thanks to jordane (JIRA User) who first reported the issue on the Vapor Discord.

@weissi
Copy link
Member Author

weissi commented Oct 1, 2019

CC @millenomi/@spevans do we know what changed in URLSession from 5.0.x to 5.1?

@weissi
Copy link
Member Author

weissi commented Oct 1, 2019

@swift-ci create

@weissi
Copy link
Member Author

weissi commented Oct 1, 2019

also CC tannernelson (JIRA User)/ianpartridge (JIRA User)/@pushkarnk/@Lukasa/ldewailly (JIRA User): We all must stop using URLSession on Linux on Swift 5.1 or else applications might randomly get stuck, crash, or whatever really because arbitrary file descriptors disappear.

For SwiftNIO users, in debug builds, SwiftNIO will always crash if it detects EBADF because there's really no sane situation where EBADF would ever be expected. Because URLSession seems to close arbitrary file descriptors, SwiftNIO users might see this problem as

Assertion failed: blacklisted errno 9 Bad file descriptor in writev(descriptor:iovecs:)): file /app/.build/checkouts/swift-nio.git--7833811624640057844/Sources/NIO/System.swift, line 108

or similar.

@gwynne
Copy link
Contributor

gwynne commented Oct 1, 2019

Probably caused by 02656bf

libcurl ignores the socket close callback in certain uses due to the fall-through in the "secondary" case in https://github.com/curl/curl/blob/5050edb12484551f4c666212409c117cf38f9784/lib/connect.c#L1414-L1446, but I'm not clear on whether that would matter for CFURLSession's usage. I tend to doubt it, really.

If anything, it looks like the sockets are being closed too quickly, which is ironic considering the commit was intended to avoid exactly that issue. The commit should probably be amended to ensure the curl handle has been fully torn down first if that is not already happening (I didn't look closely enough to be sure, so apologies if that's a red herring).

@weissi
Copy link
Member Author

weissi commented Oct 2, 2019

@swift-ci create

1 similar comment
@kevints
Copy link
Mannequin

kevints mannequin commented Oct 2, 2019

@swift-ci create

@drodriguez
Copy link
Collaborator

When I read this issue in the forums I was afraid it was going to be those changes in Foundation/cURL.

We still don't understand why the double closing is happening, but Gwynne clues might be true.

I will try to explain the situation before the patch was applied.

  • The socket is created by cURL, and provided to URLSession._MultiHandle through register(socket:, for:, what:, socketSourcePtr🙂.

  • In there, a _SocketSource is created, and one or two DispatchSource are created, associated with the socket file descriptor.

  • The rules of the DispatchSource is that you should only close the socket after the sources have been cancelled, in the cancelled callback of the source.

  • However, before the patch, the socket was closed by cURL, and the sources were never cancelled (there were no call to tearDown even if the code was there). The sources were simply released and forgotten. It might happen, however, that the polling happening inside the dispatch sources will be blocked forever if it has no timeout, since epoll will not receive an event if the process itself is closing the file descriptor (my knowledge about dispatch is not the best, so that might not be a factual description).

  • In Windows, where we find the problem, since it doesn't use epoll, but cpio, the polling will fail when the handle is closed, crashing the program.

What the changes tries to avoid was that inversion of the order, and doing the things in the right order: tell cURL that we will close the sockets, and tear down the dispatch sources when the socket was not longer used by cURL.

However, as Gwynne points out, that special branch in cURL about secondary sockets would have never been triggered in the previous patch, and it is being triggered now. Or maybe calling tearDown is the thing that is causing the problem. Maybe tearDown is being called twice and then invoking the notify block twice, which will close the file descriptor twice. I will try to figure out what that branch from cURL is doing, and if unregister can happen twice, and try to guard about it.

I would like to say, however, that the patch was trying to solve a problem that might have left a lot of epoll resources blocked forever in the case our analysis was right.

@millenomi
Copy link
Contributor

Revert for now.

@millenomi
Copy link
Contributor

#2529

@drodriguez
Copy link
Collaborator

@weissi: sorry, I forgot to ask before. I don't know that much about Swift-NIO, but why the changes in Foundation will affect the usage of writev in Swift-NIO? Or you mean that indirectly, a process that uses both Swift-NIO and Foundation independently will see those failures?

@weissi
Copy link
Member Author

weissi commented Oct 3, 2019

@drodriguez SwiftNIO has a number of blacklisted errnos that must never happen and if they do, SwiftNIO will (in debug mode) crash the program. EBADF is one of our blacklisted assertions. This bug leads to double closes which might close random file descriptors owned by SwiftNIO.

So if you use this broken version of URLSession and SwiftNIO in the same process, it’s highly likely that sooner or later your program will crash with the above error.

I added this to the description here in case someone googles for it, we’ve had multiple reports of this crash already, all due to the bug described here. In fact, that’s how we first found it 🙂

@drodriguez
Copy link
Collaborator

Thanks for the explanation @weissi!

I have been looking into this a little bit more. I split the repro into a Server and a Client to discard possible interaction, and it seems to be completely driven by FoundationNetworking.

From what I can find out in the cases I have captured, it seems that the socket closing works correctly. Example of strace with my comments on the side. We are following socket 13.

// I think this is cURL checking if there's something to read in the socket (https://github.com/curl/curl/blob/curl-7_58_0/lib/select.c#L217)
[pid   318] poll([{fd=13, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0 <unfinished ...>
...
// This is cURL reading from the socket (probably https://github.com/curl/curl/blob/curl-7_58_0/lib/sendf.c#L668, but not sure)
[pid   318] recvfrom(13,  <unfinished ...>
...
[pid   318] <... recvfrom resumed> "HTTP/1.1 200 OK\r\nContent-Type: 8"..., 16384, 0, NULL, NULL) = 9052
...
// I think this libdispatch, removing the fd from the epoll used in Linux (there are several calls to epoll_ctl with EPOLL_CTL_DEL in https://github.com/apple/swift-corelibs-libdispatch/blob/master/src/event/event_epoll.c)
[pid   320] epoll_ctl(31, EPOLL_CTL_DEL, 13, NULL <unfinished ...>
...
[pid   320] <... epoll_ctl resumed> )   = 0
...
// Since above it seemed to be removed from dispatch, I think this is the cancel callback from the event source in FoundationNetworking (https://github.com/apple/swift-corelibs-foundation/blob/feebc938e69c8f2ce36ef297ea6c95e03c4d08c4/Foundation/URLSession/libcurl/MultiHandle.swift#L455)
[pid   318] close(13 <unfinished ...>
...
// One of the dup(1) is started from the main thread
[pid   315] dup(1 <unfinished ...>
...
[pid   318] <... close resumed> )       = 0
// And dup receives the recently closed 13
[pid   315] <... dup resumed> )         = 13
...
// And then cURL seems to be polling the socket again
[pid   322] poll([{fd=13, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0 <unfinished ...>
...
[pid   322] <... poll resumed> )        = 0 (Timeout)
// And then the close from the main thread happens (it can also be seen that 39 is closed before, and 40 is closed after, which was the same order that the dups above were returning descriptors)
[pid   315] close(13 <unfinished ...>
...
// Finally cURL tries to create a request in that socket that was closed
[pid   322] sendto(13, "GET / HTTP/1.1\r\nHost: localhost:"..., 199, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
...
// which obviosuly fails
[pid   322] <... sendto resumed> )      = -1 EBADF (Bad file descriptor)

My problem is now finding why that second poll happens, or why cURL thinks that it can use that file descriptor for sendto. I am pretty sure that the dispatch sources has been cancelled, and so they cannot execute the event handler, but that might be one way that cURL thinks that socket is valid, but it also means that their internal structures are keeping that socket fd around.

I want to see if I have some time to check without the patch, and see if the problem reproduces or not. I really don't see why letting cURL close the socket is that different than delaying closing the socket after the cancellation of the dispatch sources (which is the correct thing to do according to the docs, and seems the correct thing to do if you don't want crashes in some OS).

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
@shahmishal shahmishal transferred this issue from apple/swift May 5, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants