Releases: orbs-network/orbs-network-go
Updated Ethereum Locking address
new locking contract - tokens locked for 14 days.
Node sync fix & Election with locked stake
node sync fix
Fix node sync getting stuck during block consensus verification.
issue description
Node sync single goroutine hangs on block consensus validation, infinitely queries old unavailable state, and cannot recover
Example flow
Node sync uses a single goroutine.
Node sync runs in parallel to consensus.
Consensus is delayed on block height 'h' and node sync is triggered.
Node sync receives block at height 'h' (from a node which already managed to commit it) and turns to validate it in audit mode - includes block execution.
During this time consensus manages to progress, commits blocks 'h', .., 'h + archive_state_support'; Advancing the node's state storage.
Node sync then runs block consensus validation which queries the old unavailable state (at height 'h') until success forever.
fix applied
Consensus validation during node sync will only try once falling back on node sync built in robustness to retry the entire sync flow starting from the node's current blocks state.
Busy polling on state to acquire committee is now only part of LeanHelix which holds another goroutine for responsiveness and progress.
Election with locked stake
Allow locked stake of orbs to be counted during elections.
Fix node sync infinite committee polling
node sync fix
Fix node sync getting stuck during block consensus verification.
issue description
Node sync single goroutine hangs on block consensus validation, infinitely queries old unavailable state, and cannot recover
Example flow
Node sync uses a single goroutine.
Node sync runs in parallel to consensus.
Consensus is delayed on block height 'h' and node sync is triggered.
Node sync receives block at height 'h' (from a node which already managed to commit it) and turns to validate it in audit mode - includes block execution.
During this time consensus manages to progress, commits blocks 'h', .., 'h + archive_state_support'; Advancing the node's state storage.
Node sync then runs block consensus validation which queries the old unavailable state (at height 'h') until success forever.
fix applied
Consensus validation during node sync will only try once falling back on node sync built in robustness to retry the entire sync flow starting from the node's current blocks state.
Busy polling on state to acquire committee is now only part of LeanHelix which holds another goroutine for responsiveness and progress.
Reputation is always used for committee ordering
Move committee generate source to an SDK and use only the committee contract for ordering with reputation.
(also reduction of code duplication).
Reputation calculation change
Reputation activation and algorithm change.
We are preparing our codebase for POSv2 by moving in small increments whenever possible.
In this blog post we describe the improvements in the way the order of a random committee is generated and the corresponding changes to the reputation mechanism.
Terms:
Close a block - leader node suggests a block and that block gets the consensus of the network.
Misses - the number of consecutive times a node did not manage to close a block as a leader. This number is cleared back to 0 (= perfect) the next time the node will close a block.
Current code (pre v1.3.9):
We use a config setting (consensus-context-committee-using-contract) to activate the committee contract, which both keeps track of misses and generates the random committee.
The problem:
The configuration is set to false by default, so we don’t actually use the contract. Also we have two copies of very similar code to randomize and 3 copies of very similar code to internally call a contract method.
Reputation v1 (Algorithm)
Misses: keep track of the number of times the leader didn’t close its suggested block.
Reputation: normalize the misses count. First three misses don’t count (reputation stays 0), between 4 and 10 misses the value of the reputation is the number of misses. Cap the reputation at 10.
Algorithm: give each node a random score (number 1-2^32), if the reputation is not 0 divide that score by 2^reputation. Sort the nodes by score from high to low. This is the order of the committee.
The problem:
The reputation doesn’t work as planned. The idea of reputation-v1 was that node with reputation 10 should get 1 in 1024 blocks chance to close (was thought to be roughly translatable to once every 3-4 hours). In practice the algorithm gives that leader node a 1 in 1024^(number of nodes - 1) chance which translates to weeks or even months. Even when a few nodes have a reputation of 5 their chance to close a block and go back to reputation 0 is very very very low.
Timing Configuration
To keep the liveness of the network, if the block is empty (common case at the moment), we delay the first leader so that it always fails to close the block.
The problem:
If reputation worked properly the configuration to delay an empty block would be superfluous. Having that configuration actually “pushes” nodes to the bad list.
Reputation v2 (orbs v1.3.9):
General changes:
- Timing configuration is removed.
- Work towards turning the configuration of committee-contract to be true by default. Assume input (list of validators) is ordered under consensus (automatically true for election, simple sort added for genesis list).
- Change the value of reputation to be 0 if misses are two or less (2 = tolerance), 1-6 for misses counts of 3-8, and capped at 6. Each reputation point is a reduction of factor 2 (approximately) in chance to be leader (locked by tests).
Reputation v2 (Algorithm)
To order the list of validators, we calculate each node’s absolute weight as 2^(cap of reputation - actual reputation). This is a power of 2 value between 64 (perfect reputation of 0) and 1 (worst reputation of 6). We use this absolute weight to create weighted probability draw without repeats. This means the chance of being leader (when there are nodes with non perfect values) is not absolute but reflects size of committee and the reputation values of all nodes.
This is a close approximation of factor 2 of its reputation value during ordering, under the more likely scenarios such as:
- All nodes are closing blocks (all reputation values are 0).
- Only one node is down (soon to be voted out) and it has a reputation value of 6.
- A few nodes had short down times and have reputation 1-2 and are waiting for a chance to close a block and return to reputation 0.
Examples:
- Committee size 20, all nodes close block successfully. Each node has a chance of 1 in 20 to close blocks.
- Committee size 20, one node with reputation 6 will have chance of 1 in 1217 (19 * 64 + 1) to be leader. While the other nodes will have a chance of 1 in 19 (64 / (19 * 64 + 1) ).
- Committee size is 20, where 5 nodes have a reputation of 2. Each of those will have a chance of 1 in 65 ( 16 / (1564 + 516) ) to be leader vs 1 in 16 ( 64 / (1564 + 516) ) for the nodes with perfect reputation.
Improve memory usage + fix to allow running reputation
v1.3.7
Gossip multicast bugfix
includes all changes from v1.3.3, and in addition:
#1504, which fixed a bug where TCP transport failed a multicast message if it encounters a recipient it's not familiar with.
Elections delegation fix
v1.3.5 attempting to fix the ci for tags
Elections delegation fix
This is a hotfix release fixing (#1502) a bug introduced in v1.3.2 due to a change in how go-ethereum
parses events. The _Elections
Orbs contract mirrors delegation data from Ethereum. There are two methods of expressing delegations: a delegator can either transfer 0.07 ORBS to the delegate, or she can explicitly call OrbsVoting.delegate
on Ethereum.
In the Orbs contract, events are read from Ethereum and applied to the Orbs state so that Orbs reflects the state changes in Ethereum. The event is parsed to a Golang struct, which was missing the delegationCount
field. v1.3.2 upgraded go-ethereum
from v1.8.18, which was lenient towards missing fields when parsing events to structs, to v1.9.6, which isn't lenient. As a result, all explicit delegation events emitted after November 24th, which is when v1.3.2 was deployed to mainnet, have not been mirrored.
Other changes included in this release:
- #1462 - prevents Gossip from broadcasting messages when multicast was requested. This fix will prevent view change messages from begin delivered to all nodes rather than only to the leader.
- #1475 - upgrades Scribe (logger library) to v0.2.3
- #1477, #1478, #1480 #1487 - various logging changes to help troubleshoot production issues