Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: single shard tracking State cleanup #12734

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

staffik
Copy link
Contributor

@staffik staffik commented Jan 14, 2025

Cleanup parent shard State if neither child was tracked for GC window epochs.
That also implements shards garbage collection in general, see #11883.

TODO: add more tests

Summary

  • We cleanup the unused shards State with the delay of GC window epochs. One reason is that GC modifies the State, and removing the State earlier would result at least in negative refcounts, if not more serious problems.
  • For that, we need to know if a shard was not tracked since GC window epochs. One caveat is that validator operator could potentially changed the validator key in this period, so we should not rely on the current validator key (or even tracking config) to tell what shards were tracked in past.
  • We use TrieChanges column, to determine what shards were tracked at the given epoch. We rely on TrieChanges being saved to the last block of an epoch, for all shards that were tracked at given epoch. TODO: add a test that focuses on that
  • The cleanup for shards is only triggered when we gc-ed the last block of an epoch, always a final block and in canonical chain.
  • For each shard that we cleaned up, we remove State mapping for it, as the shard being deleted means we do not have the State for any descendant shard too.
  • Of course we should not remove the State of shards that are currently tracked. And we do not remove State of shards that we care about in the next epoch.

Testing

GC num epochs to keep set to 3.

Notation

P - parent shard
C - child shard
U - unrelated shard
Schedule: (epoch before resharding) | (epoch after resharding) | ... next epochs

Tested scenarios

  • P | C | U ... test_resharding_v3_state_cleanup
  • P | U ... test_resharding_v3_do_not_track_children_after_resharding
  • P | C | U | U | U | U | U | C ... test_resharding_v3_stop_track_child_for_5_epochs (in the end we do not map to parent)
  • P | C1 | U | U | C2 | U | U | C1 ... test_resharding_v3_stop_track_child_for_5_epochs_with_sibling_in_between (in the end we map to parent)
  • P | U | C ... test_resharding_v3_shard_shuffling_untrack_then_track
  • U | U | C ... test_resharding_v3_sync_child

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 92.64706% with 15 lines in your changes missing coverage. Please review.

Project coverage is 70.81%. Comparing base (483a0e8) to head (6b191d9).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
chain/chain/src/garbage_collection.rs 89.70% 0 Missing and 14 partials ⚠️
core/store/src/adapter/flat_store.rs 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #12734       +/-   ##
===========================================
+ Coverage    1.65%   70.81%   +69.15%     
===========================================
  Files         670      848      +178     
  Lines      120798   174365    +53567     
  Branches   120798   174365    +53567     
===========================================
+ Hits         2001   123476   +121475     
+ Misses     118694    45734    -72960     
- Partials      103     5155     +5052     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (?)
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.35% <0.00%> (-0.01%) ⬇️
linux 69.26% <78.43%> (+67.60%) ⬆️
linux-nightly 70.37% <92.64%> (?)
pytests 1.65% <0.00%> (-0.01%) ⬇️
sanity-checks 1.46% <0.00%> (-0.01%) ⬇️
unittests 70.64% <92.64%> (?)
upgradability 0.20% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@staffik staffik marked this pull request as ready for review January 15, 2025 13:23
@staffik staffik requested a review from a team as a code owner January 15, 2025 13:23
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't finish the review but so far looks good. I'll let others approve or come back to it tomorrow.

Comment on lines +241 to +252
let tracked_shards_in_gced_epoch_to_check_for_cleanup = if !shard_tracker
.tracks_all_shards()
&& epoch_manager.is_last_block_in_finished_epoch(block_hash)?
{
Some(get_tracked_shards_in_past_epoch(
&chain_store_update,
&epoch_manager,
block_hash,
)?)
} else {
None
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe move to a helper method?

Comment on lines +261 to +262
if let Some(potential_shards_for_cleanup) =
tracked_shards_in_gced_epoch_to_check_for_cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Keep consistent naming e.g. call both potential_shards_for_cleanup.

Comment on lines +1092 to +1096
fn get_tracked_shards_in_past_epoch(
chain_store_update: &ChainStoreUpdate,
epoch_manager: &Arc<dyn EpochManagerAdapter>,
past_epoch_block_hash: &CryptoHash,
) -> Result<Vec<ShardUId>, Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this method relies on the fact that the block is in past epoch or that it is the last block of that epoch. I would suggest removing past_epoch from the method method and argument names.

Comment on lines +1114 to +1125
/// State cleanup for single shard tracking. Removes State of shards that are no longer in use.
///
/// It has to be run after we clear block data for the `last_block_hash_in_gced_epoch`.
/// `tracked_shards_in_gced_epoch` are shards that were tracked in the gc-ed epoch,
/// and these are shards that we potentially no longer use and that can be cleaned up.
/// We do not clean up a shard if it has been tracked in any epoch later,
/// or we care about it in the current or the next epoch (relative to Head).
///
/// With ReshardingV3, we use State mapping (see DBCol::StateShardUIdMapping),
/// where each `ShardUId` is potentially mapped to its ancestor to get the database key prefix.
/// We only remove a shard State if all its descendants are ready to be cleaned up,
/// in which case, we also remove the mapping from `StateShardUIdMapping`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great stuff!

Copy link
Contributor

@Trisfald Trisfald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

return Err(Error::GCError(
"block on canonical chain shouldn't have refcount 0".into(),
));
}
debug_assert_eq!(blocks_current_height.len(), 1);

// Do not clean up immediatelly, as we still need the State to run gc for this block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Do not clean up immediatelly, as we still need the State to run gc for this block.
// Do not clean up immediately, as we still need the State to run gc for this block.

@@ -13,6 +15,18 @@ pub fn shuffle_receipt_proofs<ReceiptProofType>(
receipt_proofs.shuffle(&mut rng);
}

pub fn cares_about_shard_this_or_next_epoch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: As I read this function, it might as well be a member of shard tracker.
Feel free to ignore

/// Maximum number of epochs under which the test should finish.
const TESTLOOP_NUM_EPOCHS_TO_WAIT: u64 = 8;
/// Default number of epochs for resharding testloop to run.
// TODO(resharding) Fix nearcore and set it to 10.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any insight on why it fails with 10? If you do it might be good to capture that in a comment or issue.

If you don't have any relevant clue, feel free to ignore this comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants