-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconnect peerd if connection is dropped during running swap #426
Conversation
a479042
to
8e91986
Compare
Codecov Report
@@ Coverage Diff @@
## main #426 +/- ##
=======================================
- Coverage 13.0% 12.7% -0.4%
=======================================
Files 27 27
Lines 7915 8089 +174
=======================================
- Hits 1030 1024 -6
- Misses 6885 7065 +180
Continue to review full report at Codecov.
|
I can't reproduce the test failure on my local machine. Seems like the heuristic I am using here to select the correct peerd does not work on the CI. |
@h4sh3d I implemented the approach we discussed on 11.05.2022 and it seems to be working quite nicely. I had to add some more hacky things though to get reliable information on when the spawned thread is terminated. Again the tests completed fine locally. Please also drop some opinions on a reconnect back-off period. Maybe it's even for the best if we just reconnect every |
src/peerd/runtime.rs
Outdated
@@ -301,7 +315,10 @@ impl Runtime { | |||
&request.get_type() | |||
); | |||
self.messages_sent += 1; | |||
self.sender.send_message(message)?; | |||
if self.peer_sender.send_message(message.clone()).is_err() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use while(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 34574ec
src/peerd/runtime.rs
Outdated
senders: &mut esb::SenderList<ServiceBus, ServiceId>, | ||
) -> Result<(), Error> { | ||
// If this is the listener-forked peerd, terminate it. | ||
if !self.connect { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this outside the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 0e97200
eb83221
to
fc8b3af
Compare
Taker log with manual internet cut after Monero funding:
Corresponding maker log:
|
9447425
to
7042a64
Compare
Alice Taker log when killing her internet connection after Bob Maker funds:
Corresponding Bob Maker log:
|
Same persistent maker as above (hence reconnect), Bob Taker log when killing after funding:
Corresponding Alice Maker log:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK
src/peerd/runtime.rs
Outdated
fn restart_receiver_runtime( | ||
internal_identity: ServiceId, | ||
peer_receiver: PeerReceiver, | ||
thread_flag_tx: std::sync::mpsc::Sender<()>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dying_thread_flag_tx
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 0e97200
src/peerd/runtime.rs
Outdated
// The PeerReceiverRuntime failed, attempt to reconnect with the counterpary | ||
// It is safe to unwrap remote_node_addr here, since it is Some(..) if connect=true | ||
let mut connection = | ||
PeerConnection::connect(self.remote_node_addr.clone().unwrap(), &self.local_node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use expect
instead of unwrap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 0e97200
this laid the groundwork for all the peer connection refactoring, as you mentionned previously the next steps should be in separate PRs. IMO its not important if tests don't pass for now |
Not going to work on this further until the deps bump #404 is in. |
2aaa124
to
a1db046
Compare
Otherwise a borrow is held over another borrow
a1db046
to
f99900b
Compare
fd664ea
to
7bf7081
Compare
I think I found the correct peerd bridge socket types with a PUSH (connect) / PULL (bind) socket combination, so I'll ask for another round of review now in the hopes that this actually worked out. |
Here are the logs from 20 parallel swaps yesterday where I disconnected the takers' network connection just after Bob's funding, or Bob's funding attempt (auto-funding failed for the Bob takers - had to fund them manually). 4 swapped successfully - the other 16 refunded. Of the successful swaps, taker was Alice thrice and Bob once, so no clear relation there. |
I've been combing over these logs for some hours now and in the failed swaps it seems like they don't progress beyond AccLock, meaning that Alice is either not funding, or the syncers are not detecting the funding. I found no evidence that messages sent through peerd are not making it through, or that peerd is somehow connected to the problem. In the logs of the refunded swaps, I see that Alice never reaches the state where she has to send BuyProcSig to Bob. I currently feel like this problem is unrelated to the changes in this PR. Maybe a service used by the Monero syncer is not handling the reconnect properly? I saw that all the swaps are now using the Monero LWS, could it be connected to it? |
So my understanding is that the PAIR socket was not appropriate because one does not have control over the peer connection and it might drop, and PAIR socket has undefined behavior in case the connection drops. Is that correct? |
I missed your comment yesterday - apologies. Thanks for taking the time to dig through: your finding does also match what I'm seeing in these logs #462 so far too. I'll try again without the lws backend. |
the best PR that i've seen by @TheCharlatan by far, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work
This is more defensive and might catch future edge cases.
Yes, you got it, but for total clarity I'll explain again in my own words. The Since the communication through the bridge is unidirectional from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all comments addressed, thank you
This pull request adds code to let the taker, or connecting peerd reconnect to the listener if the connection is dropped, for example by a connection outage. To make the internal ServiceId address of the spawned peerds deterministic I added a single message exchange at the beginning of the connection where the taker/connector informs the maker/listener of its node id, allowing us to re-identify connections.
There are multiple ways by which a disconnect is detected currently. Foremost a disconnect is immediately detected and reported by peerd iff the counterparty suddenly hangs up. If there is a connection issue, such as an internet outage, peerd notices it once it attempts to send a ping to the counterparty. If the ping fails, this is reported as well. Additionally, if swapd encounters an error a message to peerd, it also reports this. Once detected a connection failure is detected, peerd shutsdown. The taker's/connector's farcasterd then has to spawn a new connecting peerd. Swapd caches its failed peerd requests and fires them off again once farcasterd reports back to it the peerd connection might be running again.
Logic to handle multiple re-connection attempts with a back-off period has not been implemented yet.
I am having some trouble testing this cleanly. Ideally I would like to simulate the connection being dropped, for example by an internet outage. @Lederstrumpf @h4sh3d do you have suggestions for simulating this, for example with something like low-level port manipulation?
EDIT: I added a test where I just kill peerd, but this has the effect, that one side is not handling the disconnect cleanly.