Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP-32622: avoid using select and instead use epoll #4877

Closed

Conversation

snwoods
Copy link
Contributor

@snwoods snwoods commented Dec 16, 2022

To avoid xapi from running out of file descriptors, replace select with epoll in most cases

let epoll = Polly.create () in
List.iter (fun fd -> Polly.add epoll fd Polly.Events.inp) (r @ w) ;
ignore
@@ Polly.wait epoll 4 (-1) (fun _ fd _ ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the old code can be summarised as: "Wait for any r to be ready to reading or any of the w to be ready for writing. Then read/write to the fd found ready. Start again.". In every round of the loop (every time select returns) one or more fds would be ready, and it may often be just one. So we can't assume that we can write + read in the same round.

The third argument of the function passed in here, currently ignore with _, is set to the kind of event that happened and indicates whether we can read from or write to fd.

inner remain_time max_bytes
else
inner remain_time max_bytes
Unix.setsockopt_float fd Unix.SO_RCVTIMEO (Int64.to_float max_time) ;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is giving a parse error on make test

@snwoods snwoods force-pushed the private/stevenwo/CP-32622 branch 3 times, most recently from 4a79673 to a562d49 Compare January 10, 2023 15:57
if !i < !buf_remote_end + b then final := true ;
buf_remote_end := !i
) ;
if fd = file_desc then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative style could use match:

match file_desc with
| file_desc when file_desc = Unix.stdin -> ..
| file_desc when file_desc = fd -> ..
| _otherwise ->

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to handle one or all FDs that are ready? Up to 2 can be ready. This code handles two if they are ready but we have other places where we handle only one. It's not obvious why we have different behavior.

@snwoods snwoods force-pushed the private/stevenwo/CP-32622 branch 2 times, most recently from 823ef93 to ea9cb03 Compare January 13, 2023 14:31
[comms_sock; fd_sock] ;
(* Although there are two fds, we set max_fds to 1 here as we only want this
function to trigger once so that we get one return value *)
Polly.wait_fold epoll 1 (-1) state (fun _ fd _ _ ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1 correct here though, would that prevent it from even watching the other FD?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that one FD would be processed per round - and any other FD that is ready would be reported again in the next round. But does this guarantee fairness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only way I could think to make it work, do you have any other alternatives?

@snwoods snwoods force-pushed the private/stevenwo/CP-32622 branch 3 times, most recently from 535388b to 596a6dc Compare January 23, 2023 12:31
@snwoods snwoods marked this pull request as ready for review January 23, 2023 12:31
@snwoods snwoods force-pushed the private/stevenwo/CP-32622 branch from 596a6dc to 33c4bb6 Compare January 23, 2023 12:34
@psafont psafont changed the title Private/stevenwo/cp 32622 CP-32622: avoid using select and instead use epoll Jan 23, 2023
@lindig
Copy link
Contributor

lindig commented Jan 24, 2023

Remember that the timeout in Polly.wait is specified in milliseconds. Maybe I should change the Polly interface to use a labeled argument timeout_ms.

@snwoods snwoods force-pushed the private/stevenwo/CP-32622 branch from 33c4bb6 to d4d6c10 Compare January 24, 2023 10:07
~finally:(fun () -> Polly.close epoll)
(fun () ->
ignore
@@ Polly.wait epoll 4 (-1) (fun _ fd _ ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the code below matches exactly one condition and not all fds that are ready; could this be written as

if fd = x then do something;
if fd = y then do something;
..

Maybe leave a comment why the selected behavior is the right one.

if !i < !buf_remote_end + b then final := true ;
buf_remote_end := !i
) ;
if fd = file_desc then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to handle one or all FDs that are ready? Up to 2 can be ready. This code handles two if they are ready but we have other places where we handle only one. It's not obvious why we have different behavior.

(fun () ->
ignore
@@ Polly.wait epoll 4 (-1) (fun _ fd event ->
if event = Polly.Events.inp then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this handles at most one FD. I think the unconditional else case is a bit dangerous and I would explicitly check than any FD is ready before acting on it. This applies to other places as well.

match Unix.read fd bytes 0 4096 with
| 0 ->
Buffer.contents buf (* EOF *)
| n ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be expressed as

| n when n >= max_bytes -> ..
| n (* otherwise *) ->
| exception .. ->

List.iter (fun fd -> Polly.add epoll fd Polly.Events.inp) r ;
List.iter (fun fd -> Polly.add epoll fd Polly.Events.out) w ;
Fun.protect
~finally:(fun () -> Polly.close epoll)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use this pattern quite a lot, should we have a with_polly somewhere?

(fun () ->
Unix.setsockopt_float s Unix.SO_RCVTIMEO timeout ;
try fst (Unix.accept s)
with Unix.Unix_error (Unix.EAGAIN, _, _) -> raise Unixext.Timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SO_RCVTIMEO work with accept? I can't find a clear statement for or against in the manpage of 'socket(7)', it mentions read/recvmsg/send/sendmsg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're right, the only examples online of using a timeout with accept are using select

inner remain_time max_bytes
else
inner remain_time max_bytes
( try Unix.setsockopt_float fd SO_RCVTIMEO (Int64.to_float max_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do seem to repeat this quite often, perhaps as a future cleanup PR we could define a helper function: 'timed_read' which does the setsockopt+read+fun.protect to set it back, same for write a 'timed_write' might be useful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would avoid mistakes, such as around resetting the socket option.

to_close := fd :: !to_close ;
proxy fd proxy_socket
) else
assert false (* can never happen *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be so sure, we could get spurious notifications from 'epoll', and we don't want to crash the application (or thread). Perhaps just log it as error for now?
Especially that you've added 2 file descriptors s_ip and s_unix, this will NOT be just s_unix, it is entirely possible that s_ip will have something available, otherwise why add it to epoll in the first place?

@snwoods snwoods force-pushed the private/stevenwo/CP-32622 branch 5 times, most recently from 8f13617 to 89bb393 Compare February 6, 2023 09:36
@snwoods
Copy link
Contributor Author

snwoods commented Feb 6, 2023

I believe all of the comments have now been addressed.

@snwoods snwoods force-pushed the private/stevenwo/CP-32622 branch 2 times, most recently from d4b4878 to e124de2 Compare February 6, 2023 12:39
Comment on lines 341 to 344
if fds = [] then (* We must have timed out *)
raise Unixext.Timeout
else (* There will only ever be a maximum of one fd *)
List.hd fds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While List.hd is guarded here, it's preferrable to destructure the list to obtain values from it:

Suggested change
if fds = [] then (* We must have timed out *)
raise Unixext.Timeout
else (* There will only ever be a maximum of one fd *)
List.hd fds
match fds with
| [] -> (* We must have timed out *)
raise Unixext.Timeout
| fd :: _ -> (* There will only ever be a maximum of one fd *)
fd

@snwoods snwoods force-pushed the private/stevenwo/CP-32622 branch from e124de2 to da3f2d9 Compare February 6, 2023 13:07
@snwoods snwoods force-pushed the private/stevenwo/CP-32622 branch 2 times, most recently from eed05d5 to f259cae Compare February 8, 2023 11:30
@snwoods
Copy link
Contributor Author

snwoods commented Feb 9, 2023

All comments addressed and suiterun 179915 passing.

@@ -394,7 +394,7 @@ let rec retry f = function
try f ()
with Stunnel_initialisation_failed ->
(* Leave a few seconds between each attempt *)
ignore (Unix.select [] [] [] 3.) ;
Unix.sleep 3 ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Thread.delay

@snwoods snwoods force-pushed the private/stevenwo/CP-32622 branch from f259cae to f4c2087 Compare February 10, 2023 10:48
n
with Unix.Unix_error (Unix.EAGAIN, _, _) -> -1
in
Unix.setsockopt_float ic.fd Unix.SO_RCVTIMEO 0. ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better use finally to ensure that this timeout is reset in case of any exception (only one kind is caught above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every read on ic.fd seems to go through here, and each read is preceded by setting a timeout, so I don't think we even need to reset the timeout, just drop this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more reads in http.ml but they also set a socket timeout. Will need to check the other callers too.

r = []
try
ignore (Unix.read sock_out (Bytes.create 1) 0 1) ;
Unix.setsockopt_float sock_out Unix.SO_RCVTIMEO 0. ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we'd miss resetting the socket option in case of an exception.

@edwintorok
Copy link
Contributor

Testing whether 'select' is completely gone or not is tricky due to a lot of dead code in the linked binary.
I hacked up a script to find at least the actual OCaml functions that call 'Unix.select' directly:

#!/bin/sh
set -eu

EXE="$1"
ADDR="${EXE}.unix_select.address"
DISASM="${EXE}.disasm"

# Find address of 'unix_select' symbol
addr2line -pae "${EXE}" unix_select \
  | sed -e "s/^0x0*\\([0-9a-fA-F]*\\):.*/\\1/" \
  | tr -d '\n' >"${ADDR}"

# Find all calls to 'unix_select', which is done by
#  mov $0x...,%rax
#  call <caml_c_call>

# disassemble, but without sources
# (source lookup doesn't work for all dependencies, and is very slow on a large binary)
# To make debugging easier the disassembly is saved to a file instead of piping
objdump -d --no-show-raw-insn --no-addresses "${EXE}" >"${DISASM}"

SYMADDR=$(cat "${ADDR}")
# Prints a list of functions that call unix_select
# But we won't know whether it is dead code or not
gawk -f unix_select.gawk "${SYMADDR}" <"${DISASM}"

unix_select.gawk:

BEGIN { SYMADDR=ARGV[1]; ARGV[1]=""; }

/^<.*/ { SYMBOL=$1 }
match($0, /mov.*0x([0-9a-fA-F]*),%rax/, addr) { RAX=addr[1]; }
/call.*<caml_c_call>/ { if (RAX == SYMADDR) { print SYMBOL }  }

It shows these for XAPI and xenopsd:

./no-unix-select.sh _build/default/ocaml/xapi/xapi_main.exe
<camlXapi_stdext_threads__Threadext__fun_812>:
<camlXapi_stdext_unix__Unixext__kill_and_wait_inner_2120>:
<camlXapi_stdext_unix__Unixext__proxy_1371>:
<camlXapi_stdext_unix__Unixext__time_limited_write_internal_1424>:
<camlXapi_stdext_unix__Unixext__time_limited_read_1450>:
<camlThread__wait_timed_read_777>:
<camlThread__wait_timed_write_781>:
<camlUnix__fun_2441>:

And for xenopsd:

./no-unix-select.sh _build/default/ocaml/xenopsd/xc/xenops_xc_main.exe
<camlXapi_stdext_unix__Unixext__kill_and_wait_inner_2120>:
<camlXapi_stdext_unix__Unixext__proxy_1371>:
<camlXapi_stdext_unix__Unixext__time_limited_write_internal_1424>:
<camlXapi_stdext_unix__Unixext__time_limited_read_1450>:
<camlXapi_stdext_threads__Threadext__fun_812>:
<camlWatch__fun_884>:
<camlThread__wait_timed_read_777>:
<camlThread__wait_timed_write_781>:
<camlUnix__fun_2441>:

@edwintorok
Copy link
Contributor

The 'select' in watch.ml can be replaced similarly as elsewhere by changing pipe->socket and using socket timeouts.

@edwintorok
Copy link
Contributor

Removed some Unix.select and Thread.wait* usages in xapi-project/stdext#80.

And a few more from XAPI and xenopsd as well: https://github.com/edwintorok/xen-api/commits/private/edvint/CP-32622

We'll probably need a CA ticket and backport the watch.ml fix since it affects xenopsd which already claims to support >1024 fds.

I think these now remove all the obvious Unix.select usages, but we should implement what has been suggested here and add some CI checks in xs-opam with various methods to check that we really don't call it anymore (and then add some unit tests to all these functions that we modified to check that they still work correctly):
https://discuss.ocaml.org/t/ensuring-that-an-ocaml-binary-never-calls-unix-select/13465/7?u=edwin

@edwintorok
Copy link
Contributor

I've updated the no-unix-select.sh to be recursive up to a certain depth, and it didn't find more calls.
Going to try the .cmt iterator suggested in the forums just to have another way to validate this.

@snwoods
Copy link
Contributor Author

snwoods commented Mar 11, 2024

I'm closing this and have opened a PR against Edwin's fork so that he can still keep track of it edwintorok#4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants