Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decentralize parachain relayer #1265

Closed
wants to merge 17 commits into from
Closed

Decentralize parachain relayer #1265

wants to merge 17 commits into from

Conversation

yrong
Copy link
Contributor

@yrong yrong commented Aug 8, 2024

@yrong yrong marked this pull request as ready for review August 9, 2024 01:39
err = li.subscribeNewMMRRoots(ctx)
if err != nil {
if errors.Is(err, context.Canceled) {
err = li.doScan(ctx, beefyBlockNumber)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cleanup here remove the subscribeNewMMRRoots.

The reason is that when switching to a multiple relayers context with task scheduled in a round-robin manner. In case when a relayer is down, subscribeNewMMRRoots does not make sense for avoid all relayers competing for the fail message at the same time.

It seems a periodical scan with some randomness make more sense here, each round only one message is processed by a relayer, submit anyway when that message is timeout without anyone pick it up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i understand this - surely all relayers should still subscribe to getting new mmr roots, even if they're not actively going to send a message proof for a given message?

Copy link
Contributor Author

@yrong yrong Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all relayers competing for the same message at the same time in case when a relayer is down

Mainly for address the issue above. And I think subscribe is not required in any case, we do periodical scan for beacon also.

The pattern here is that in each round only one message will be processed by one relayer, then it will sleep for some time until next round, essentially give othere relayer chance to avoid race condition.

Copy link
Collaborator

@vgeddes vgeddes Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to this change, but I think we need more upfront design and analysis before writing more code. I'm worried that introducing polling & randomness could result in undefined behavior.

Specifically, I think the following would help:

  1. In high-level pseudocode, a step-by-step analysis showing how a relayer would
  • Detect new messages
  • Makes sure there is MMR root which can be used to generate proofs
  • Backs off if it is not a designated relayer for a message
  1. A theoretical simulation in which there are three different relayers, and a subset of them are offline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@alistair-singh alistair-singh Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of randomness we should have primary and secondary slots, or N many slots.

A relayer would calculate their slots as follows for a message nonce:
Primary slot: nonce % numRelayers == relayerId
Secondary slot: nonce % numRelyars - 1 == relayerId
Tertiary Slot: nonce % numRelayers - 2 == relayerId
... etc, for as many slots as there are relayers

If a nonce lands on the primary slot for a relayer it relays immediately.
If a nonce lands on secondary slot it sleeps 3 minutes, checks if the nonce is relayed, and if not relays it.
If a nonce lands on the tertiary slot, it sleeps 6 minutes, checks if the nonce is relayed, and if not relays it.
... etc, for as many slots as there are relayers.

This is deterministic and every relayer covers every other relayers nonce at least once in a given timeframe. If there is a misbehaving relayer we can easily identify who is missing their slots.

The code can be quite simple as well:

slot = (nonce % numRelayers) - relayerId

Then you would always sleep before relaying the message for slot*3 minutes before relaying. When slot == 0 the sleep would be 0 minutes i.e. relay immediately. And then before relaying the nonce we would add a check if the nonce has not been relayed yet, which will always be the case for slot 0.

Copy link
Contributor Author

@yrong yrong Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Though there is still concern in my mind does that mean the wait time will be linear correlation to the number of the relayers? So if we deployed 5 relayers, the worst case one relayer could wait for 15min, and 10 relayers for 30 mins. Is that waiting time too long?

Or is there any significant drawback for current impl(i.e. random polling + fixed waiting), IMHO it's not complicated and relayer also covers each other as tests demonstrated.

Copy link
Contributor Author

@yrong yrong Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alistair-singh I think your idea is good, the concern above(e.g. waiting too long when there is a large number of relayers and most of them are down) is very rare and may not be a real problem.

So I add another PR in #1266 follow the above design, please review & check if that make sense.

IMO the current implementation(i.e. random polling + fixed waiting) is also not bad, it's very easy replicate to the execution relayer for the reverse direction, as there is only polling support there. With refactoring in this PR(i.e. remove subscribing) we share the same pattern for both directions.

I'll let team to make the final decision.

@yrong yrong requested a review from musnit August 9, 2024 05:46
@acatangiu
Copy link

hey, so we built some machinery for decentralizing relayers which could be reused by you too - not yet deployed at bridge hub: paritytech/parity-bridges-common#2605

Instead of the bridge posting a reward and having relayers go round-robin, there is instead a bidding round and relayer with the lowest bid (cheapest relayer) wins. Worth exploring.

@musnit
Copy link
Collaborator

musnit commented Aug 9, 2024

hey, so we built some machinery for decentralizing relayers which could be reused by you too - not yet deployed at bridge hub: paritytech/parity-bridges-common#2605

Instead of the bridge posting a reward and having relayers go round-robin, there is instead a bidding round and relayer with the lowest bid (cheapest relayer) wins. Worth exploring.

oh cool - do you have an idea of eta and scope of effort to integrate with this (including offchain code needed)?

Our idea with this current PR to decentralize via round robin is that it's something we can ship and deploy in < a week for short term use.

@acatangiu
Copy link

Our idea with this current PR to decentralize via round robin is that it's something we can ship and deploy in < a week for short term use.

Yes, the bidding mechanism can be done after, it should also go through an audit.

@yrong
Copy link
Contributor Author

yrong commented Aug 11, 2024

a bidding round and relayer with the lowest bid (cheapest relayer) wins.

Intersting, sounds more like an fee market based on Dutch auction? Do you have an exist PR for that somewhere?

IIUC your bridge depends on some on chain logic with relayers queue and a signed extension to

  • reject obsolete messages from another relayer here
  • add additional priority boost for relayer which is assigned to that slot here

which helps to resolve the race condition mentioned above, correct me if I'm wrong.

As Aidan mentioned what we're going to release is a short-term improvement pure offchain with acceptable tradeoff for message to Ethereum, so we won't introduce any on-chain logic/extension here.

But we'll denefitely take that into consideration for the long run. Thanks again!

@@ -32,6 +35,18 @@ type SinkContractsConfig struct {
Gateway string `mapstructure:"Gateway"`
}

type RelayerConfig struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming should be more specific. "Relayer" is vague.

Ie something like ScheduleConfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sink SinkConfig `mapstructure:"sink"`
Source SourceConfig `mapstructure:"source"`
Sink SinkConfig `mapstructure:"sink"`
Relay RelayerConfig `mapstructure:"relay"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest more specific naming.

Suggested change
Relay RelayerConfig `mapstructure:"relay"`
Schedule ScheduleConfig `mapstructure:"schedule"`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -32,6 +35,18 @@ type SinkContractsConfig struct {
Gateway string `mapstructure:"Gateway"`
}

type RelayerConfig struct {
ID uint64 `mapstructure:"id"`
Num uint64 `mapstructure:"num"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments needs for these fields please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err = li.subscribeNewMMRRoots(ctx)
if err != nil {
if errors.Is(err, context.Canceled) {
err = li.doScan(ctx, beefyBlockNumber)
Copy link
Collaborator

@vgeddes vgeddes Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to this change, but I think we need more upfront design and analysis before writing more code. I'm worried that introducing polling & randomness could result in undefined behavior.

Specifically, I think the following would help:

  1. In high-level pseudocode, a step-by-step analysis showing how a relayer would
  • Detect new messages
  • Makes sure there is MMR root which can be used to generate proofs
  • Backs off if it is not a designated relayer for a message
  1. A theoretical simulation in which there are three different relayers, and a subset of them are offline.

@yrong
Copy link
Contributor Author

yrong commented Aug 14, 2024

Why polling

The main reason to use polling than subscribing is as mentiond here #1265 (comment), to avoid race condition.

Consider we've deployed 5 relayers and 1 is down, with subscribe I would assume all other relayers will compete for the same message which is assigned to that failed relayer and only one will be accepted on chain(others rejected by InvalidNonce), consider the cost for submitV1 in Gateway contract is 3 dollars and will be 3*(5-1-1)=9 dollars wasted which we'd like to avoid.

Timeout Analysis

Take a specific transfer from production for example, from logs as follows:

`"message":"{\"@timestamp\":\"2024-08-12T16:04:27.679804405Z\",\"level\":\"info\",\"message\":\"Nonces are mismatched, scanning for commitments that need to be relayed\"}"`

`"message":"{\"@timestamp\":\"2024-08-12T16:05:16.249771481Z\",\"level\":\"info\",\"message\":\"Beefy Listener emitted new task\"}"`

`"message":"{\"@timestamp\":\"2024-08-12T16:05:42.844767874Z\",\"channelID\":\"0x0000000000000000000000000000000000000000000000000000000000000000\",\"level\":\"info\",\"message\":\"Message dispatched\",\"nonce\":12,\"success\":true}"`

As we can see it’s about 50 seconds for scanning the message, 25 seconds for submitV1 to Gateway and receive the receipt, totally 75 seconds for a sending round to finish.

So we can wait 60 seconds for the task to timeout (i.e. enough to cover the timeout for submitV1) and set polling period to 90+rand(30) secs(i.e. enough to cover the total round).

@yrong
Copy link
Contributor Author

yrong commented Aug 14, 2024

Tests with 3 parachain relayers setup

Configuration

Relayer 0:

./relayer/build/snowbridge-relay run parachain --config /tmp/snowbridge/parachain-relay-asset-hub-0.json --ethereum.private-key **

{
"id": 0,
"num": 3
}

Relayer 1:

./relayer/build/snowbridge-relay run parachain --config /tmp/snowbridge/parachain-relay-asset-hub-1.json --ethereum.private-key **

{
"id": 1,
"num": 3
}

Relayer 2:

./relayer/build/snowbridge-relay run parachain --config /tmp/snowbridge/parachain-relay-asset-hub-2.json --ethereum.private-key **

{
"id": 2,
"num": 3
}

Test Flow

  1. Running smoke test first register WETH with
cargo test --test register_token -- --nocapture
  1. Send token to AH with
cargo test --test send_token -- --nocapture
  1. Run transfer_token to send token to Ethereum for the first time(nonce:1)
cargo test --test transfer_token -- --nocapture

From logs as follows it’s picked up by Relayer 1 as expected

Relayer 0:

{"@timestamp":"2024-08-13T10:03:00.889538+08:00","level":"info","message":"waiting for nonce 1 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:03:25.90941+08:00","level":"info","message":"nonce 1 picked up by another relayer, just skip"}
..
{"@timestamp":"2024-08-13T10:05:04.916829+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":1}

Relayer 1:

{"@timestamp":"2024-08-13T10:03:24.978745+08:00","level":"info","message":"nonce 1 self assigned to relay(1)"}
..
{"@timestamp":"2024-08-13T10:03:30.992641+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":1,"success":true}

Relayer 2:

{"@timestamp":"2024-08-13T10:03:36.398372+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":1}
  1. Run transfer_token to send token to Ethereum for the second time(nonce:2)
cargo test --test transfer_token -- --nocapture

From logs as follows it’s picked up by Relayer 2 as expected

Relayer 0:

{"@timestamp":"2024-08-13T10:12:19.980181+08:00","level":"info","message":"waiting for nonce 2 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:12:45.000143+08:00","level":"info","message":"nonce 2 picked up by another relayer, just skip"}

Relayer 1:

{"@timestamp":"2024-08-13T10:12:19.037005+08:00","level":"info","message":"waiting for nonce 2 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:12:44.04805+08:00","level":"info","message":"nonce 2 picked up by another relayer, just skip"}

Relayer 2:

{"@timestamp":"2024-08-13T10:12:40.46347+08:00","level":"info","message":"nonce 2 self assigned to relay(2)"}
...
{"@timestamp":"2024-08-13T10:12:46.474835+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":2,"success":true}
  1. Shutdown Relayer 0 then run transfer_token to send token to Ethereum for the third time(nonce:3)
cargo test --test transfer_token -- --nocapture

From logs as follows it’s first spotted and picked up by Relayer 1

Relayer 1:

{"@timestamp":"2024-08-13T10:19:44.079279+08:00","level":"info","message":"waiting for nonce 3 to be picked up by another relayer"}
...
{"@timestamp":"2024-08-13T10:20:49.099806+08:00","level":"info","message":"nonce 3 timeout but picked up by relay(1)"}

{"@timestamp":"2024-08-13T10:20:55.106054+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":3,"success":true}

Relayer 2:

{"@timestamp":"2024-08-13T10:19:53.521488+08:00","level":"info","message":"waiting for nonce 3 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:20:53.559145+08:00","level":"info","message":"nonce 3 picked up by another relayer, just skip"}
  1. Shutdown Relayer 1 then run transfer_token to send token to Ethereum for the fourth time(nonce:4)
cargo test --test transfer_token -- --nocapture

From logs as follows it’s picked up by Relayer 2

Relayer 2:

{"@timestamp":"2024-08-13T10:31:18.578844+08:00","level":"info","message":"waiting for nonce 4 to be picked up by another relayer"}
...
{"@timestamp":"2024-08-13T10:32:23.613604+08:00","level":"info","message":"Beefy Listener emitted new task"}
{"@timestamp":"2024-08-13T10:32:23.613623+08:00","level":"info","message":"nonce 4 timeout but picked up by relay(2)"}

{"@timestamp":"2024-08-13T10:32:29.627741+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":4,"success":true}
  1. Start Relay 0 and Relayer 1 and run transfer_token to send token to Ethereum for the fifth time(nonce:5)
cargo test --test transfer_token -- --nocapture

From logs as follows it’s picked up by Relayer 2 as expected

Relay 0:

{"@timestamp":"2024-08-13T10:40:42.545415+08:00","level":"info","message":"waiting for nonce 5 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:41:27.568863+08:00","level":"info","message":"nonce 5 picked up by another relayer, just skip"}

Relay 1:

{"@timestamp":"2024-08-13T10:40:46.318362+08:00","level":"info","message":"waiting for nonce 5 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:41:26.345597+08:00","level":"info","message":"nonce 5 picked up by another relayer, just skip"}

Relay 2:

{"@timestamp":"2024-08-13T10:41:22.655163+08:00","level":"info","message":"nonce 5 self assigned to relay(2)"}
...
{"@timestamp":"2024-08-13T10:41:28.662386+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":5,"success":true}
  1. Stop all relayers and run transfer_token to send token to Ethereum for multiple times(nonces from 6 to 9), wait for all messages to timeout and then start all relayers.

From logs as follows nonces are evenly distributed to each relayer.

Relay 0 pick up nonce 6 and nonce 9:

{"@timestamp":"2024-08-13T10:56:02.711806+08:00","level":"info","message":"nonce 6 self assigned to relay(0)"}
{"@timestamp":"2024-08-13T10:56:08.728365+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":6,"success":true}
...
{"@timestamp":"2024-08-13T11:10:28.110246+08:00","level":"info","message":"waiting for nonce 7 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T11:10:33.112358+08:00","level":"info","message":"nonce 7 picked up by another relayer, just skip"}
...
{"@timestamp":"2024-08-13T11:12:04.115559+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":8}
...
{"@timestamp":"2024-08-13T11:12:04.140457+08:00","level":"info","message":"nonce 9 self assigned to relay(0)"}
{"@timestamp":"2024-08-13T11:12:10.148195+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":9,"success":true}

Relay 1 pick up nonce 7:

{"@timestamp":"2024-08-13T11:10:30.161278+08:00","level":"info","message":"nonce 7 self assigned to relay(1)"}
{"@timestamp":"2024-08-13T11:10:36.169644+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":7,"success":true}
...
{"@timestamp":"2024-08-13T11:12:01.170838+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":8}
{"@timestamp":"2024-08-13T11:12:01.171389+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce generated by parachain outbound queue","nonce":9}

Relay 2 pick up nonce 8:

{"@timestamp":"2024-08-13T11:10:33.094379+08:00","level":"info","message":"nonce 8 self assigned to relay(2)"}
{"@timestamp":"2024-08-13T11:10:39.101708+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":8,"success":true}
...
{"@timestamp":"2024-08-13T11:13:51.142968+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce generated by parachain outbound queue","nonce":9}

@yrong
Copy link
Contributor Author

yrong commented Aug 14, 2024

Close in favor if #1266

@yrong yrong closed this Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants