Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple endpoints and receive data by train stride #70

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

philsmt
Copy link

@philsmt philsmt commented Oct 19, 2023

Client part of https://git.xfel.eu/karaboDevices/TrainMatcher/-/merge_requests/42

Adds support to select to only receive certain trains via a divisor and remainder, as well as the DEALER pattern to connect to multiple endpoints, each sending part of a train's data, and match the data afterwards. The matching part is not robust against the bridges going out sync at this point (e.g. asking for even trains may not yield the same result across two bridges). The current plan is to see in operations how much of a problem this is.

@philsmt philsmt requested a review from tmichela October 19, 2023 14:47
@codecov
Copy link

codecov bot commented Oct 19, 2023

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Files Coverage Δ
karabo_bridge/client.py 86.15% <80.00%> (-6.87%) ⬇️

📢 Thoughts on this report? Let us know!.

@takluyver
Copy link
Member

I think you're already aware how this could go wrong if the senders aren't sufficiently aligned, especially if one of them skips over the key train we're waiting for - it will wait for divisor trains to get the next matching one.

But I think it's also going to get very confusing if one of the addresses you give it is wrong - either a typo, or a sender that's not running. Because ZMQ allows you to connect() before the other side calls bind(), connect() always succeeds, and it tries to establish the connection in the background. So you send out N requests but only have M<N connected peers; ZMQ either sends two requests to the same peer, or else queues up messages for 'to be connected' peers. Of course, making a mistake is already a waiting game with a single sender, but I suspect it's going to be both easier to do and more confusing with multiple.

If we focus less on making something similar to the existing Karabo-bridge protocol, what might a solution look like? For instance:

  • Client connects to a REP-REQ socket, sends a request for a 'modulo 5' channel
    • This can have a fairly short timeout so you get a failure fast if there's no reply
  • Sender creates a PUB socket for 'modulo 5' (if it doesn't already have one), replies with an address to connect to
    • The request could also go to a coordinator which talks to N senders and replies with N addresses
  • Sender starts sending out messages with the remainder (mod 5) as a topic
  • Client creates a SUB socket, connects to the address it got, and subscribes to its desired remainder

I'm not saying we should do it exactly like that, but I don't think the reply-with-data pattern is a great starting point, and I don't see much benefit of sticking to it given that we have to change both sides anyway.

@philsmt
Copy link
Author

philsmt commented Oct 23, 2023

Thanks for having a look. I want to stress that this concept tries to enhance the current semantics without being overly intrusive at all. After all, the N senders to M train-parallelized clients is still a bit of a niche pattern.

I think you're already aware how this could go wrong if the senders aren't sufficiently aligned, especially if one of them skips over the key train we're waiting for - it will wait for divisor trains to get the next matching one.
But I think it's also going to get very confusing if one of the addresses you give it is wrong - either a typo, or a sender that's not running. Because ZMQ allows you to connect() before the other side calls bind(), connect() always succeeds, and it tries to establish the connection in the background. So you send out N requests but only have M<N connected peers; ZMQ either sends two requests to the same peer, or else queues up messages for 'to be connected' peers. Of course, making a mistake is already a waiting game with a single sender, but I suspect it's going to be both easier to do and more confusing with multiple.

Yes, the matching part on the client side is extraordinary naive at the moment. It definitely needs a timeout to cope with missing bridges, and a check whether the train IDs actually match up. Frankly I'm still unsure how much of a problem misaligned bridges will be. Amusingly, it will become more stable the more clients with a higher divisor are connected.

If we focus less on making something similar to the existing Karabo-bridge protocol, what might a solution look like? For instance:

  • Client connects to a REP-REQ socket, sends a request for a 'modulo 5' channel

    • This can have a fairly short timeout so you get a failure fast if there's no reply
  • Sender creates a PUB socket for 'modulo 5' (if it doesn't already have one), replies with an address to connect to

    • The request could also go to a coordinator which talks to N senders and replies with N addresses
  • Sender starts sending out messages with the remainder (mod 5) as a topic

  • Client creates a SUB socket, connects to the address it got, and subscribes to its desired remainder

I considered the PUB-SUB idea first and found the bridge-side setting of it a major disadvantage. It's indeed an interesting idea to make the configuration part of the protocol. However, in the end we may also simply create additional bridges for such cases (and leave a default one intact, remember TrainMatchers can have multiple bridges already) to avoid handling teardown of such sockets as well.

I'm not saying we should do it exactly like that, but I don't think the reply-with-data pattern is a great starting point, and I don't see much benefit of sticking to it given that we have to change both sides anyway.

Agree, there are much better designs possible with deeper changes. As you suggest, I would actually start with the patterns on both ends, replacing the current push/poll concept with an actual loop involving control messages. This would significantly reduce the risk for the "simple protocol" of this PR to get out of sync, and allow even deeper alignment based on actual train IDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants