Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamodb retry config for throttling and other errors. Add exponential backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail #1023

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

KaspariK
Copy link
Member

@KaspariK KaspariK commented Jan 15, 2025

  1. Use Boto3 retries (see Retries - Boto3)
  2. Backoff on getting unprocessed keys

…ntial backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail
@KaspariK KaspariK force-pushed the u/kkasp/TRON-2342-exponential-backoff-dynamo-get branch from b361235 to 3e74d75 Compare January 20, 2025 15:43
log.warning(
f"Attempt {attempts}/{MAX_UNPROCESSED_KEYS_RETRIES} - Retrying {len(cand_keys_list)} unprocessed keys after {delay:.2f}s delay."
)
time.sleep(delay)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What to do about this lil guy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!8ball we should use a restore thread

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, we should probably try to figure out a non-blocking way to do this or have this run in a separate thread - if we get to the worst case of 5 attempts and this is running on the reactor thread, we'll essentially block all of tron from doing anything for 20s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although, actually - this is probably fine since we do all sorts of blocking stuff in restore and aren't expecting tron to be usable/do anything until we've restored everything

...so maybe this is fine?

@KaspariK KaspariK marked this pull request as ready for review January 21, 2025 16:39
@KaspariK KaspariK requested a review from a team as a code owner January 21, 2025 16:39
@@ -294,7 +296,8 @@ def test_delete_item_with_json_partitions(self, store, small_object, large_objec
vals = store.restore([key])
assert key not in vals

def test_retry_saving(self, store, small_object, large_object):
@mock.patch("time.sleep", return_value=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my personal preference is usually to use the context manager way of mocking since that gives a little more control over where a mock is active, but not a blocker :)

@@ -8,7 +8,9 @@
from moto.dynamodb2.responses import dynamo_json_dump

from testifycompat import assert_equal
from testifycompat.assertions import assert_in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use the native pytest/python assertions in new code - testifycompat shouldn't really be used for new code (it's a compatibility layer to aid in migrating from our old testing framework to pytest )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and related, I'd probably replace the new assert_equal calls with assert X == Y)

Comment on lines +337 to +340
def side_effect_random_uniform(a, b):
return b

mock_random_uniform.side_effect = side_effect_random_uniform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think

Suggested change
def side_effect_random_uniform(a, b):
return b
mock_random_uniform.side_effect = side_effect_random_uniform
mock_random_uniform.side_effect = lambda a, b: b

would work too - but either way is fine (having a nested function definition in a test usually looks a little off to me since it's sometimes a sign that said function could be a mock and a lambda makes it a lot clearer that this is a throwaway)

Comment on lines +347 to +349
with pytest.raises(Exception) as exec_info, mock.patch(
"tron.config.static_config.load_yaml_file", autospec=True
), mock.patch("tron.config.static_config.build_configuration_watcher", autospec=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could these be merged with the outer context manager?

log.warning(
f"Attempt {attempts}/{MAX_UNPROCESSED_KEYS_RETRIES} - Retrying {len(cand_keys_list)} unprocessed keys after {delay:.2f}s delay."
)
time.sleep(delay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, we should probably try to figure out a non-blocking way to do this or have this run in a separate thread - if we get to the worst case of 5 attempts and this is running on the reactor thread, we'll essentially block all of tron from doing anything for 20s

log.warning(
f"Attempt {attempts}/{MAX_UNPROCESSED_KEYS_RETRIES} - Retrying {len(cand_keys_list)} unprocessed keys after {delay:.2f}s delay."
)
time.sleep(delay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although, actually - this is probably fine since we do all sorts of blocking stuff in restore and aren't expecting tron to be usable/do anything until we've restored everything

...so maybe this is fine?

exponential_delay = min(base_delay_seconds * (2 ** (attempts - 1)), max_delay_seconds)
# Full jitter (i.e. from 0 to exponential_delay) will help minimize the number and length of calls
jitter = random.uniform(0, exponential_delay)
delay = jitter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we mean to add the exponential delay + the jitter? or is waiting for a random time between 0 and the expected delay what we wanted? (i.e., is waiting 0 seconds fine?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i see - boto is doing it's own exponential backoff?

imo, we can likely skip adding any jitter - each dynamodb table only has a single reader/writer (tron) so there's not much of a risk of a thundering herd scenario :)

Comment on lines +354 to +362
# We also need to verify that sleep was called with expected delays
expected_delays = []
base_delay_seconds = 0.5
max_delay_seconds = 10
for attempt in range(1, MAX_UNPROCESSED_KEYS_RETRIES + 1):
expected_delay = min(base_delay_seconds * (2 ** (attempt - 1)), max_delay_seconds)
expected_delays.append(expected_delay)
actual_delays = [call.args[0] for call in mock_sleep.call_args_list]
assert_equal(actual_delays, expected_delays)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd maybe extract the exponential backoff logic in tron/serialize/runstate/dynamodb_state_store.py to a function so that we can write a more targeted test for that and simplify this to checking if we called that function the right amount of times

(mostly 'cause I generally try to avoid for loops/calculations inside tests :p)

f"tron_dynamodb_restore_failure: failed to retrieve items with keys \n{failed_keys}\n from dynamodb\n{resp.result()}"
)
raise error
result = resp.result()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should also print the response when we get into the exception block to also have an idea on why we got unprocessed keys and why we exceeded the attempts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe we add it here

                except Exception as e:
                    log.exception("Encountered issues retrieving data from DynamoDB")
                    raise e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants