Decentralize parachain relayer #1265

yrong · 2024-08-08T07:29:19Z

Resolves: https://linear.app/snowfork/issue/SNO-1118

This reverts commit 2fe3c30.

yrong · 2024-08-09T02:50:34Z

relayer/relays/parachain/beefy-listener.go

-		err = li.subscribeNewMMRRoots(ctx)
-		if err != nil {
-			if errors.Is(err, context.Canceled) {
+			err = li.doScan(ctx, beefyBlockNumber)


Some cleanup here remove the subscribeNewMMRRoots.

The reason is that when switching to a multiple relayers context with task scheduled in a round-robin manner. In case when a relayer is down, subscribeNewMMRRoots does not make sense for avoid all relayers competing for the fail message at the same time.

It seems a periodical scan with some randomness make more sense here, each round only one message is processed by a relayer, submit anyway when that message is timeout without anyone pick it up.

not sure i understand this - surely all relayers should still subscribe to getting new mmr roots, even if they're not actively going to send a message proof for a given message?

all relayers competing for the same message at the same time in case when a relayer is down

Mainly for address the issue above. And I think subscribe is not required in any case, we do periodical scan for beacon also.

The pattern here is that in each round only one message will be processed by one relayer, then it will sleep for some time until next round, essentially give othere relayer chance to avoid race condition.

I am open to this change, but I think we need more upfront design and analysis before writing more code. I'm worried that introducing polling & randomness could result in undefined behavior.

Specifically, I think the following would help:

In high-level pseudocode, a step-by-step analysis showing how a relayer would

Detect new messages

Makes sure there is MMR root which can be used to generate proofs

Backs off if it is not a designated relayer for a message

A theoretical simulation in which there are three different relayers, and a subset of them are offline.

https://www.notion.so/snowfork/Decentralized-relayer-24511a7ec58b48f6911b9cf1307012af

I think instead of randomness we should have primary and secondary slots, or N many slots.

A relayer would calculate their slots as follows for a message nonce:
Primary slot: nonce % numRelayers == relayerId
Secondary slot: nonce % numRelyars - 1 == relayerId
Tertiary Slot: nonce % numRelayers - 2 == relayerId
... etc, for as many slots as there are relayers

If a nonce lands on the primary slot for a relayer it relays immediately.
If a nonce lands on secondary slot it sleeps 3 minutes, checks if the nonce is relayed, and if not relays it.
If a nonce lands on the tertiary slot, it sleeps 6 minutes, checks if the nonce is relayed, and if not relays it.
... etc, for as many slots as there are relayers.

This is deterministic and every relayer covers every other relayers nonce at least once in a given timeframe. If there is a misbehaving relayer we can easily identify who is missing their slots.

The code can be quite simple as well:

slot = (nonce % numRelayers) - relayerId

Then you would always sleep before relaying the message for slot*3 minutes before relaying. When slot == 0 the sleep would be 0 minutes i.e. relay immediately. And then before relaying the nonce we would add a check if the nonce has not been relayed yet, which will always be the case for slot 0.

Cool! Though there is still concern in my mind does that mean the wait time will be linear correlation to the number of the relayers? So if we deployed 5 relayers, the worst case one relayer could wait for 15min, and 10 relayers for 30 mins. Is that waiting time too long?

Or is there any significant drawback for current impl(i.e. random polling + fixed waiting), IMHO it's not complicated and relayer also covers each other as tests demonstrated.

@alistair-singh I think your idea is good, the concern above(e.g. waiting too long when there is a large number of relayers and most of them are down) is very rare and may not be a real problem.

So I add another PR in #1266 follow the above design, please review & check if that make sense.

IMO the current implementation(i.e. random polling + fixed waiting) is also not bad, it's very easy replicate to the execution relayer for the reverse direction, as there is only polling support there. With refactoring in this PR(i.e. remove subscribing) we share the same pattern for both directions.

I'll let team to make the final decision.

acatangiu · 2024-08-09T08:48:43Z

hey, so we built some machinery for decentralizing relayers which could be reused by you too - not yet deployed at bridge hub: paritytech/parity-bridges-common#2605

Instead of the bridge posting a reward and having relayers go round-robin, there is instead a bidding round and relayer with the lowest bid (cheapest relayer) wins. Worth exploring.

musnit · 2024-08-09T11:55:54Z

hey, so we built some machinery for decentralizing relayers which could be reused by you too - not yet deployed at bridge hub: paritytech/parity-bridges-common#2605

Instead of the bridge posting a reward and having relayers go round-robin, there is instead a bidding round and relayer with the lowest bid (cheapest relayer) wins. Worth exploring.

oh cool - do you have an idea of eta and scope of effort to integrate with this (including offchain code needed)?

Our idea with this current PR to decentralize via round robin is that it's something we can ship and deploy in < a week for short term use.

acatangiu · 2024-08-09T15:56:17Z

Our idea with this current PR to decentralize via round robin is that it's something we can ship and deploy in < a week for short term use.

Yes, the bidding mechanism can be done after, it should also go through an audit.

yrong · 2024-08-11T09:04:20Z

a bidding round and relayer with the lowest bid (cheapest relayer) wins.

Intersting, sounds more like an fee market based on Dutch auction? Do you have an exist PR for that somewhere?

IIUC your bridge depends on some on chain logic with relayers queue and a signed extension to

reject obsolete messages from another relayer here
add additional priority boost for relayer which is assigned to that slot here

which helps to resolve the race condition mentioned above, correct me if I'm wrong.

As Aidan mentioned what we're going to release is a short-term improvement pure offchain with acceptable tradeoff for message to Ethereum, so we won't introduce any on-chain logic/extension here.

But we'll denefitely take that into consideration for the long run. Thanks again!

vgeddes · 2024-08-12T11:48:18Z

relayer/relays/parachain/config.go

@@ -32,6 +35,18 @@ type SinkContractsConfig struct {
 	Gateway string `mapstructure:"Gateway"`
 }

+type RelayerConfig struct {


The naming should be more specific. "Relayer" is vague.

Ie something like ScheduleConfig

vgeddes · 2024-08-12T11:48:47Z

relayer/relays/parachain/config.go

-	Sink   SinkConfig   `mapstructure:"sink"`
+	Source SourceConfig  `mapstructure:"source"`
+	Sink   SinkConfig    `mapstructure:"sink"`
+	Relay  RelayerConfig `mapstructure:"relay"`


Suggest more specific naming.

Suggested change

Relay RelayerConfig `mapstructure:"relay"`

Schedule ScheduleConfig `mapstructure:"schedule"`

vgeddes · 2024-08-12T11:49:10Z

relayer/relays/parachain/config.go

@@ -32,6 +35,18 @@ type SinkContractsConfig struct {
 	Gateway string `mapstructure:"Gateway"`
 }

+type RelayerConfig struct {
+	ID  uint64 `mapstructure:"id"`
+	Num uint64 `mapstructure:"num"`


Comments needs for these fields please

vgeddes · 2024-08-12T12:47:10Z

relayer/relays/parachain/beefy-listener.go

-		err = li.subscribeNewMMRRoots(ctx)
-		if err != nil {
-			if errors.Is(err, context.Canceled) {
+			err = li.doScan(ctx, beefyBlockNumber)


I am open to this change, but I think we need more upfront design and analysis before writing more code. I'm worried that introducing polling & randomness could result in undefined behavior.

Specifically, I think the following would help:

In high-level pseudocode, a step-by-step analysis showing how a relayer would

Detect new messages

Makes sure there is MMR root which can be used to generate proofs

Backs off if it is not a designated relayer for a message

A theoretical simulation in which there are three different relayers, and a subset of them are offline.

yrong · 2024-08-14T02:41:41Z

Why polling

The main reason to use polling than subscribing is as mentiond here #1265 (comment), to avoid race condition.

Consider we've deployed 5 relayers and 1 is down, with subscribe I would assume all other relayers will compete for the same message which is assigned to that failed relayer and only one will be accepted on chain(others rejected by InvalidNonce), consider the cost for submitV1 in Gateway contract is 3 dollars and will be 3*(5-1-1)=9 dollars wasted which we'd like to avoid.

Timeout Analysis

Take a specific transfer from production for example, from logs as follows:

`"message":"{\"@timestamp\":\"2024-08-12T16:04:27.679804405Z\",\"level\":\"info\",\"message\":\"Nonces are mismatched, scanning for commitments that need to be relayed\"}"`

`"message":"{\"@timestamp\":\"2024-08-12T16:05:16.249771481Z\",\"level\":\"info\",\"message\":\"Beefy Listener emitted new task\"}"`

`"message":"{\"@timestamp\":\"2024-08-12T16:05:42.844767874Z\",\"channelID\":\"0x0000000000000000000000000000000000000000000000000000000000000000\",\"level\":\"info\",\"message\":\"Message dispatched\",\"nonce\":12,\"success\":true}"`

As we can see it’s about 50 seconds for scanning the message, 25 seconds for submitV1 to Gateway and receive the receipt, totally 75 seconds for a sending round to finish.

So we can wait 60 seconds for the task to timeout (i.e. enough to cover the timeout for submitV1) and set polling period to 90+rand(30) secs(i.e. enough to cover the total round).

yrong · 2024-08-14T02:45:05Z

Tests with 3 parachain relayers setup

Configuration

Relayer 0:

./relayer/build/snowbridge-relay run parachain --config /tmp/snowbridge/parachain-relay-asset-hub-0.json --ethereum.private-key **

{
"id": 0,
"num": 3
}

Relayer 1:

./relayer/build/snowbridge-relay run parachain --config /tmp/snowbridge/parachain-relay-asset-hub-1.json --ethereum.private-key **

{
"id": 1,
"num": 3
}

Relayer 2:

./relayer/build/snowbridge-relay run parachain --config /tmp/snowbridge/parachain-relay-asset-hub-2.json --ethereum.private-key **

{
"id": 2,
"num": 3
}

Test Flow

Running smoke test first register WETH with

cargo test --test register_token -- --nocapture

Send token to AH with

cargo test --test send_token -- --nocapture

Run transfer_token to send token to Ethereum for the first time(nonce:1)

cargo test --test transfer_token -- --nocapture

From logs as follows it’s picked up by Relayer 1 as expected

Relayer 0:

{"@timestamp":"2024-08-13T10:03:00.889538+08:00","level":"info","message":"waiting for nonce 1 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:03:25.90941+08:00","level":"info","message":"nonce 1 picked up by another relayer, just skip"}
..
{"@timestamp":"2024-08-13T10:05:04.916829+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":1}

Relayer 1:

{"@timestamp":"2024-08-13T10:03:24.978745+08:00","level":"info","message":"nonce 1 self assigned to relay(1)"}
..
{"@timestamp":"2024-08-13T10:03:30.992641+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":1,"success":true}

Relayer 2:

{"@timestamp":"2024-08-13T10:03:36.398372+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":1}

Run transfer_token to send token to Ethereum for the second time(nonce:2)

cargo test --test transfer_token -- --nocapture

From logs as follows it’s picked up by Relayer 2 as expected

Relayer 0:

{"@timestamp":"2024-08-13T10:12:19.980181+08:00","level":"info","message":"waiting for nonce 2 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:12:45.000143+08:00","level":"info","message":"nonce 2 picked up by another relayer, just skip"}

Relayer 1:

{"@timestamp":"2024-08-13T10:12:19.037005+08:00","level":"info","message":"waiting for nonce 2 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:12:44.04805+08:00","level":"info","message":"nonce 2 picked up by another relayer, just skip"}

Relayer 2:

{"@timestamp":"2024-08-13T10:12:40.46347+08:00","level":"info","message":"nonce 2 self assigned to relay(2)"}
...
{"@timestamp":"2024-08-13T10:12:46.474835+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":2,"success":true}

Shutdown Relayer 0 then run transfer_token to send token to Ethereum for the third time(nonce:3)

cargo test --test transfer_token -- --nocapture

From logs as follows it’s first spotted and picked up by Relayer 1

Relayer 1:

{"@timestamp":"2024-08-13T10:19:44.079279+08:00","level":"info","message":"waiting for nonce 3 to be picked up by another relayer"}
...
{"@timestamp":"2024-08-13T10:20:49.099806+08:00","level":"info","message":"nonce 3 timeout but picked up by relay(1)"}

{"@timestamp":"2024-08-13T10:20:55.106054+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":3,"success":true}

Relayer 2:

{"@timestamp":"2024-08-13T10:19:53.521488+08:00","level":"info","message":"waiting for nonce 3 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:20:53.559145+08:00","level":"info","message":"nonce 3 picked up by another relayer, just skip"}

Shutdown Relayer 1 then run transfer_token to send token to Ethereum for the fourth time(nonce:4)

cargo test --test transfer_token -- --nocapture

From logs as follows it’s picked up by Relayer 2

Relayer 2:

{"@timestamp":"2024-08-13T10:31:18.578844+08:00","level":"info","message":"waiting for nonce 4 to be picked up by another relayer"}
...
{"@timestamp":"2024-08-13T10:32:23.613604+08:00","level":"info","message":"Beefy Listener emitted new task"}
{"@timestamp":"2024-08-13T10:32:23.613623+08:00","level":"info","message":"nonce 4 timeout but picked up by relay(2)"}

{"@timestamp":"2024-08-13T10:32:29.627741+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":4,"success":true}

Start Relay 0 and Relayer 1 and run transfer_token to send token to Ethereum for the fifth time(nonce:5)

cargo test --test transfer_token -- --nocapture

From logs as follows it’s picked up by Relayer 2 as expected

Relay 0:

{"@timestamp":"2024-08-13T10:40:42.545415+08:00","level":"info","message":"waiting for nonce 5 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:41:27.568863+08:00","level":"info","message":"nonce 5 picked up by another relayer, just skip"}

Relay 1:

{"@timestamp":"2024-08-13T10:40:46.318362+08:00","level":"info","message":"waiting for nonce 5 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T10:41:26.345597+08:00","level":"info","message":"nonce 5 picked up by another relayer, just skip"}

Relay 2:

{"@timestamp":"2024-08-13T10:41:22.655163+08:00","level":"info","message":"nonce 5 self assigned to relay(2)"}
...
{"@timestamp":"2024-08-13T10:41:28.662386+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":5,"success":true}

Stop all relayers and run transfer_token to send token to Ethereum for multiple times(nonces from 6 to 9), wait for all messages to timeout and then start all relayers.

From logs as follows nonces are evenly distributed to each relayer.

Relay 0 pick up nonce 6 and nonce 9:

{"@timestamp":"2024-08-13T10:56:02.711806+08:00","level":"info","message":"nonce 6 self assigned to relay(0)"}
{"@timestamp":"2024-08-13T10:56:08.728365+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":6,"success":true}
...
{"@timestamp":"2024-08-13T11:10:28.110246+08:00","level":"info","message":"waiting for nonce 7 to be picked up by another relayer"}
{"@timestamp":"2024-08-13T11:10:33.112358+08:00","level":"info","message":"nonce 7 picked up by another relayer, just skip"}
...
{"@timestamp":"2024-08-13T11:12:04.115559+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":8}
...
{"@timestamp":"2024-08-13T11:12:04.140457+08:00","level":"info","message":"nonce 9 self assigned to relay(0)"}
{"@timestamp":"2024-08-13T11:12:10.148195+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":9,"success":true}

Relay 1 pick up nonce 7:

{"@timestamp":"2024-08-13T11:10:30.161278+08:00","level":"info","message":"nonce 7 self assigned to relay(1)"}
{"@timestamp":"2024-08-13T11:10:36.169644+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":7,"success":true}
...
{"@timestamp":"2024-08-13T11:12:01.170838+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce delivered to ethereum gateway","nonce":8}
{"@timestamp":"2024-08-13T11:12:01.171389+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce generated by parachain outbound queue","nonce":9}

Relay 2 pick up nonce 8:

{"@timestamp":"2024-08-13T11:10:33.094379+08:00","level":"info","message":"nonce 8 self assigned to relay(2)"}
{"@timestamp":"2024-08-13T11:10:39.101708+08:00","channelID":"0x0000000000000000000000000000000000000000000000000000000000000000","level":"info","message":"Message dispatched","nonce":8,"success":true}
...
{"@timestamp":"2024-08-13T11:13:51.142968+08:00","channelID":[193,115,250,195,36,21,142,119,251,88,64,115,138,26,84,31,99,60,190,200,136,76,106,96,28,86,125,43,55,106,5,57],"level":"info","message":"Checked latest nonce generated by parachain outbound queue","nonce":9}

yrong · 2024-08-14T14:57:16Z

Close in favor if #1266

yrong added 10 commits August 8, 2024 15:26

Decentralize parachain relayer

a6bc10f

Cleanup

2fe3c30

Cleanup

d9f601d

Add some randomness

fb3c4e0

Revert "Cleanup"

929e101

This reverts commit 2fe3c30.

Wait timeout

a6b2529

Cleanups

261fd89

Submit task if timeout

e401cb8

Chore

fce791d

Polish

2d4387c

yrong marked this pull request as ready for review August 9, 2024 01:39

yrong commented Aug 9, 2024

View reviewed changes

Short timeout & 3 relayers for test

b933ae8

yrong requested a review from musnit August 9, 2024 05:46

Increase scan period

09a9d11

yrong requested review from vgeddes and alistair-singh August 12, 2024 08:58

vgeddes reviewed Aug 12, 2024

View reviewed changes

yrong added 5 commits August 13, 2024 10:48

Rename to ScheduleConfig

ebb257b

Rename to schedule

7620602

Add comment

0deea62

Rename as scheduleConfig

a3a7f02

Rename

6dfa134

yrong mentioned this pull request Aug 14, 2024

Decentralize parachain relayer with linear timeout #1266

Merged

yrong closed this Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decentralize parachain relayer #1265

Decentralize parachain relayer #1265

yrong commented Aug 8, 2024

yrong Aug 9, 2024

musnit Aug 9, 2024

yrong Aug 9, 2024 •

edited

Loading

vgeddes Aug 12, 2024 •

edited

Loading

yrong Aug 13, 2024

alistair-singh Aug 13, 2024 •

edited

Loading

yrong Aug 13, 2024 •

edited

Loading

yrong Aug 14, 2024 •

edited

Loading

acatangiu commented Aug 9, 2024

musnit commented Aug 9, 2024

acatangiu commented Aug 9, 2024

yrong commented Aug 11, 2024 •

edited

Loading

vgeddes Aug 12, 2024

yrong Aug 13, 2024

vgeddes Aug 12, 2024

yrong Aug 13, 2024

vgeddes Aug 12, 2024

yrong Aug 13, 2024

vgeddes Aug 12, 2024 •

edited

Loading

yrong commented Aug 14, 2024 •

edited

Loading

yrong commented Aug 14, 2024 •

edited

Loading

yrong commented Aug 14, 2024

	Relay RelayerConfig `mapstructure:"relay"`
	Schedule ScheduleConfig `mapstructure:"schedule"`

Decentralize parachain relayer #1265

Decentralize parachain relayer #1265

Conversation

yrong commented Aug 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yrong Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

vgeddes Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alistair-singh Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

yrong Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

yrong Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

acatangiu commented Aug 9, 2024

musnit commented Aug 9, 2024

acatangiu commented Aug 9, 2024

yrong commented Aug 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vgeddes Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

yrong commented Aug 14, 2024 • edited Loading

Why polling

Timeout Analysis

yrong commented Aug 14, 2024 • edited Loading

Tests with 3 parachain relayers setup

Configuration

Test Flow

yrong commented Aug 14, 2024

yrong Aug 9, 2024 •

edited

Loading

vgeddes Aug 12, 2024 •

edited

Loading

alistair-singh Aug 13, 2024 •

edited

Loading

yrong Aug 13, 2024 •

edited

Loading

yrong Aug 14, 2024 •

edited

Loading

yrong commented Aug 11, 2024 •

edited

Loading

vgeddes Aug 12, 2024 •

edited

Loading

yrong commented Aug 14, 2024 •

edited

Loading

yrong commented Aug 14, 2024 •

edited

Loading