-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decentralize parachain relayer #1265
Conversation
err = li.subscribeNewMMRRoots(ctx) | ||
if err != nil { | ||
if errors.Is(err, context.Canceled) { | ||
err = li.doScan(ctx, beefyBlockNumber) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some cleanup here remove the subscribeNewMMRRoots
.
The reason is that when switching to a multiple relayers context with task scheduled in a round-robin manner. In case when a relayer is down, subscribeNewMMRRoots
does not make sense for avoid all relayers competing for the fail message at the same time.
It seems a periodical scan with some randomness make more sense here, each round only one message is processed by a relayer, submit anyway when that message is timeout without anyone pick it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i understand this - surely all relayers should still subscribe to getting new mmr roots, even if they're not actively going to send a message proof for a given message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all relayers competing for the same message at the same time in case when a relayer is down
Mainly for address the issue above. And I think subscribe is not required in any case, we do periodical scan for beacon also.
The pattern here is that in each round only one message will be processed by one relayer, then it will sleep for some time until next round, essentially give othere relayer chance to avoid race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to this change, but I think we need more upfront design and analysis before writing more code. I'm worried that introducing polling & randomness could result in undefined behavior.
Specifically, I think the following would help:
- In high-level pseudocode, a step-by-step analysis showing how a relayer would
- Detect new messages
- Makes sure there is MMR root which can be used to generate proofs
- Backs off if it is not a designated relayer for a message
- A theoretical simulation in which there are three different relayers, and a subset of them are offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of randomness we should have primary and secondary slots, or N many slots.
A relayer would calculate their slots as follows for a message nonce:
Primary slot: nonce % numRelayers == relayerId
Secondary slot: nonce % numRelyars - 1 == relayerId
Tertiary Slot: nonce % numRelayers - 2 == relayerId
... etc, for as many slots as there are relayers
If a nonce lands on the primary slot for a relayer it relays immediately.
If a nonce lands on secondary slot it sleeps 3 minutes, checks if the nonce is relayed, and if not relays it.
If a nonce lands on the tertiary slot, it sleeps 6 minutes, checks if the nonce is relayed, and if not relays it.
... etc, for as many slots as there are relayers.
This is deterministic and every relayer covers every other relayers nonce at least once in a given timeframe. If there is a misbehaving relayer we can easily identify who is missing their slots.
The code can be quite simple as well:
slot = (nonce % numRelayers) - relayerId
Then you would always sleep before relaying the message for slot*3
minutes before relaying. When slot == 0
the sleep would be 0 minutes i.e. relay immediately. And then before relaying the nonce we would add a check if the nonce has not been relayed yet, which will always be the case for slot 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Though there is still concern in my mind does that mean the wait time will be linear correlation to the number of the relayers? So if we deployed 5 relayers, the worst case one relayer could wait for 15min, and 10 relayers for 30 mins. Is that waiting time too long?
Or is there any significant drawback for current impl(i.e. random polling + fixed waiting), IMHO it's not complicated and relayer also covers each other as tests demonstrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alistair-singh I think your idea is good, the concern above(e.g. waiting too long when there is a large number of relayers and most of them are down) is very rare and may not be a real problem.
So I add another PR in #1266 follow the above design, please review & check if that make sense.
IMO the current implementation(i.e. random polling + fixed waiting) is also not bad, it's very easy replicate to the execution relayer for the reverse direction, as there is only polling support there. With refactoring in this PR(i.e. remove subscribing) we share the same pattern for both directions.
I'll let team to make the final decision.
hey, so we built some machinery for decentralizing relayers which could be reused by you too - not yet deployed at bridge hub: paritytech/parity-bridges-common#2605 Instead of the bridge posting a reward and having relayers go round-robin, there is instead a bidding round and relayer with the lowest bid (cheapest relayer) wins. Worth exploring. |
oh cool - do you have an idea of eta and scope of effort to integrate with this (including offchain code needed)? Our idea with this current PR to decentralize via round robin is that it's something we can ship and deploy in < a week for short term use. |
Yes, the bidding mechanism can be done after, it should also go through an audit. |
Intersting, sounds more like an fee market based on Dutch auction? Do you have an exist PR for that somewhere? IIUC your bridge depends on some on chain logic with relayers queue and a signed extension to
which helps to resolve the race condition mentioned above, correct me if I'm wrong. As Aidan mentioned what we're going to release is a short-term improvement pure offchain with acceptable tradeoff for message to Ethereum, so we won't introduce any on-chain logic/extension here. But we'll denefitely take that into consideration for the long run. Thanks again! |
relayer/relays/parachain/config.go
Outdated
@@ -32,6 +35,18 @@ type SinkContractsConfig struct { | |||
Gateway string `mapstructure:"Gateway"` | |||
} | |||
|
|||
type RelayerConfig struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming should be more specific. "Relayer" is vague.
Ie something like ScheduleConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relayer/relays/parachain/config.go
Outdated
Sink SinkConfig `mapstructure:"sink"` | ||
Source SourceConfig `mapstructure:"source"` | ||
Sink SinkConfig `mapstructure:"sink"` | ||
Relay RelayerConfig `mapstructure:"relay"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest more specific naming.
Relay RelayerConfig `mapstructure:"relay"` | |
Schedule ScheduleConfig `mapstructure:"schedule"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -32,6 +35,18 @@ type SinkContractsConfig struct { | |||
Gateway string `mapstructure:"Gateway"` | |||
} | |||
|
|||
type RelayerConfig struct { | |||
ID uint64 `mapstructure:"id"` | |||
Num uint64 `mapstructure:"num"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments needs for these fields please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err = li.subscribeNewMMRRoots(ctx) | ||
if err != nil { | ||
if errors.Is(err, context.Canceled) { | ||
err = li.doScan(ctx, beefyBlockNumber) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to this change, but I think we need more upfront design and analysis before writing more code. I'm worried that introducing polling & randomness could result in undefined behavior.
Specifically, I think the following would help:
- In high-level pseudocode, a step-by-step analysis showing how a relayer would
- Detect new messages
- Makes sure there is MMR root which can be used to generate proofs
- Backs off if it is not a designated relayer for a message
- A theoretical simulation in which there are three different relayers, and a subset of them are offline.
Why pollingThe main reason to use polling than subscribing is as mentiond here #1265 (comment), to avoid race condition. Consider we've deployed 5 relayers and 1 is down, with subscribe I would assume all other relayers will compete for the same message which is assigned to that failed relayer and only one will be accepted on chain(others rejected by Timeout AnalysisTake a specific transfer from production for example, from logs as follows:
As we can see it’s about 50 seconds for scanning the message, 25 seconds for So we can wait 60 seconds for the task to timeout (i.e. enough to cover the timeout for |
Tests with 3 parachain relayers setupConfigurationRelayer 0: ./relayer/build/snowbridge-relay run parachain --config /tmp/snowbridge/parachain-relay-asset-hub-0.json --ethereum.private-key ** { Relayer 1: ./relayer/build/snowbridge-relay run parachain --config /tmp/snowbridge/parachain-relay-asset-hub-1.json --ethereum.private-key ** { Relayer 2: ./relayer/build/snowbridge-relay run parachain --config /tmp/snowbridge/parachain-relay-asset-hub-2.json --ethereum.private-key ** { Test Flow
cargo test --test register_token -- --nocapture
cargo test --test send_token -- --nocapture
cargo test --test transfer_token -- --nocapture From logs as follows it’s picked up by Relayer 1 as expected Relayer 0: {"@timestamp":"2024-08-13T10:03:00.889538+08:00","level":"info","message":"waiting for nonce 1 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:03:25.90941+08:00","level":"info","message":"nonce 1 picked up by another relayer, just skip"}
..
{"@timestamp":"2024-08-13T10:05:04.916829+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":1} Relayer 1: {"@timestamp":"2024-08-13T10:03:24.978745+08:00","level":"info","message":"nonce 1 self assigned to relay(1)"}
..
{"@timestamp":"2024-08-13T10:03:30.992641+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":1,"success":true} Relayer 2: {"@timestamp":"2024-08-13T10:03:36.398372+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":1}
cargo test --test transfer_token -- --nocapture From logs as follows it’s picked up by Relayer 2 as expected Relayer 0: {"@timestamp":"2024-08-13T10:12:19.980181+08:00","level":"info","message":"waiting for nonce 2 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:12:45.000143+08:00","level":"info","message":"nonce 2 picked up by another relayer, just skip"} Relayer 1: {"@timestamp":"2024-08-13T10:12:19.037005+08:00","level":"info","message":"waiting for nonce 2 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:12:44.04805+08:00","level":"info","message":"nonce 2 picked up by another relayer, just skip"} Relayer 2: {"@timestamp":"2024-08-13T10:12:40.46347+08:00","level":"info","message":"nonce 2 self assigned to relay(2)"}
...
{"@timestamp":"2024-08-13T10:12:46.474835+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":2,"success":true}
cargo test --test transfer_token -- --nocapture From logs as follows it’s first spotted and picked up by Relayer 1 Relayer 1: {"@timestamp":"2024-08-13T10:19:44.079279+08:00","level":"info","message":"waiting for nonce 3 to be picked up by another relayer"}
...
{"@timestamp":"2024-08-13T10:20:49.099806+08:00","level":"info","message":"nonce 3 timeout but picked up by relay(1)"}
{"@timestamp":"2024-08-13T10:20:55.106054+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":3,"success":true}
Relayer 2: {"@timestamp":"2024-08-13T10:19:53.521488+08:00","level":"info","message":"waiting for nonce 3 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:20:53.559145+08:00","level":"info","message":"nonce 3 picked up by another relayer, just skip"}
cargo test --test transfer_token -- --nocapture From logs as follows it’s picked up by Relayer 2 Relayer 2: {"@timestamp":"2024-08-13T10:31:18.578844+08:00","level":"info","message":"waiting for nonce 4 to be picked up by another relayer"}
...
{"@timestamp":"2024-08-13T10:32:23.613604+08:00","level":"info","message":"Beefy Listener emitted new task"}
{"@timestamp":"2024-08-13T10:32:23.613623+08:00","level":"info","message":"nonce 4 timeout but picked up by relay(2)"}
{"@timestamp":"2024-08-13T10:32:29.627741+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":4,"success":true}
cargo test --test transfer_token -- --nocapture From logs as follows it’s picked up by Relayer 2 as expected Relay 0: {"@timestamp":"2024-08-13T10:40:42.545415+08:00","level":"info","message":"waiting for nonce 5 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:41:27.568863+08:00","level":"info","message":"nonce 5 picked up by another relayer, just skip"} Relay 1: {"@timestamp":"2024-08-13T10:40:46.318362+08:00","level":"info","message":"waiting for nonce 5 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:41:26.345597+08:00","level":"info","message":"nonce 5 picked up by another relayer, just skip"} Relay 2: {"@timestamp":"2024-08-13T10:41:22.655163+08:00","level":"info","message":"nonce 5 self assigned to relay(2)"}
...
{"@timestamp":"2024-08-13T10:41:28.662386+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":5,"success":true}
From logs as follows nonces are evenly distributed to each relayer. Relay 0 pick up nonce 6 and nonce 9: {"@timestamp":"2024-08-13T10:56:02.711806+08:00","level":"info","message":"nonce 6 self assigned to relay(0)"}
{"@timestamp":"2024-08-13T10:56:08.728365+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":6,"success":true}
...
{"@timestamp":"2024-08-13T11:10:28.110246+08:00","level":"info","message":"waiting for nonce 7 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T11:10:33.112358+08:00","level":"info","message":"nonce 7 picked up by another relayer, just skip"}
...
{"@timestamp":"2024-08-13T11:12:04.115559+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":8}
...
{"@timestamp":"2024-08-13T11:12:04.140457+08:00","level":"info","message":"nonce 9 self assigned to relay(0)"}
{"@timestamp":"2024-08-13T11:12:10.148195+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":9,"success":true} Relay 1 pick up nonce 7: {"@timestamp":"2024-08-13T11:10:30.161278+08:00","level":"info","message":"nonce 7 self assigned to relay(1)"}
{"@timestamp":"2024-08-13T11:10:36.169644+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":7,"success":true}
...
{"@timestamp":"2024-08-13T11:12:01.170838+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":8}
{"@timestamp":"2024-08-13T11:12:01.171389+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce generated by parachain outbound queue","nonce":9} Relay 2 pick up nonce 8: {"@timestamp":"2024-08-13T11:10:33.094379+08:00","level":"info","message":"nonce 8 self assigned to relay(2)"}
{"@timestamp":"2024-08-13T11:10:39.101708+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":8,"success":true}
...
{"@timestamp":"2024-08-13T11:13:51.142968+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce generated by parachain outbound queue","nonce":9} |
Close in favor if #1266 |
Resolves: https://linear.app/snowfork/issue/SNO-1118