Add dynamodb retry config for throttling and other errors. Add exponential backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail #1023

KaspariK · 2025-01-15T16:55:20Z

Use Boto3 retries (see Retries - Boto3)
Backoff on getting unprocessed keys

…ntial backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail

…or retries in test assertion

…_reading test

KaspariK · 2025-01-20T19:26:28Z

tron/serialize/runstate/dynamodb_state_store.py

+                log.warning(
+                    f"Attempt {attempts}/{MAX_UNPROCESSED_KEYS_RETRIES} - Retrying {len(cand_keys_list)} unprocessed keys after {delay:.2f}s delay."
+                )
+                time.sleep(delay)


What to do about this lil guy?

!8ball we should use a restore thread

yea, we should probably try to figure out a non-blocking way to do this or have this run in a separate thread - if we get to the worst case of 5 attempts and this is running on the reactor thread, we'll essentially block all of tron from doing anything for 20s

although, actually - this is probably fine since we do all sorts of blocking stuff in restore and aren't expecting tron to be usable/do anything until we've restored everything

...so maybe this is fine?

nemacysts · 2025-01-21T16:53:44Z

tests/serialize/runstate/dynamodb_state_store_test.py

@@ -294,7 +296,8 @@ def test_delete_item_with_json_partitions(self, store, small_object, large_objec
            vals = store.restore([key])
        assert key not in vals

-    def test_retry_saving(self, store, small_object, large_object):
+    @mock.patch("time.sleep", return_value=None)


my personal preference is usually to use the context manager way of mocking since that gives a little more control over where a mock is active, but not a blocker :)

nemacysts · 2025-01-21T16:56:46Z

tests/serialize/runstate/dynamodb_state_store_test.py

@@ -8,7 +8,9 @@
 from moto.dynamodb2.responses import dynamo_json_dump

 from testifycompat import assert_equal
+from testifycompat.assertions import assert_in


we should use the native pytest/python assertions in new code - testifycompat shouldn't really be used for new code (it's a compatibility layer to aid in migrating from our old testing framework to pytest )

(and related, I'd probably replace the new assert_equal calls with assert X == Y)

nemacysts · 2025-01-21T17:00:11Z

tests/serialize/runstate/dynamodb_state_store_test.py

+        def side_effect_random_uniform(a, b):
+            return b
+
+        mock_random_uniform.side_effect = side_effect_random_uniform


i think

Suggested change

def side_effect_random_uniform(a, b):

return b

mock_random_uniform.side_effect = side_effect_random_uniform

mock_random_uniform.side_effect = lambda a, b: b

would work too - but either way is fine (having a nested function definition in a test usually looks a little off to me since it's sometimes a sign that said function could be a mock and a lambda makes it a lot clearer that this is a throwaway)

nemacysts · 2025-01-21T17:01:56Z

tests/serialize/runstate/dynamodb_state_store_test.py

+            with pytest.raises(Exception) as exec_info, mock.patch(
+                "tron.config.static_config.load_yaml_file", autospec=True
+            ), mock.patch("tron.config.static_config.build_configuration_watcher", autospec=True):


could these be merged with the outer context manager?

nemacysts · 2025-01-21T17:08:28Z

tron/serialize/runstate/dynamodb_state_store.py

+                log.warning(
+                    f"Attempt {attempts}/{MAX_UNPROCESSED_KEYS_RETRIES} - Retrying {len(cand_keys_list)} unprocessed keys after {delay:.2f}s delay."
+                )
+                time.sleep(delay)


yea, we should probably try to figure out a non-blocking way to do this or have this run in a separate thread - if we get to the worst case of 5 attempts and this is running on the reactor thread, we'll essentially block all of tron from doing anything for 20s

nemacysts · 2025-01-21T17:09:34Z

tron/serialize/runstate/dynamodb_state_store.py

+                log.warning(
+                    f"Attempt {attempts}/{MAX_UNPROCESSED_KEYS_RETRIES} - Retrying {len(cand_keys_list)} unprocessed keys after {delay:.2f}s delay."
+                )
+                time.sleep(delay)


although, actually - this is probably fine since we do all sorts of blocking stuff in restore and aren't expecting tron to be usable/do anything until we've restored everything

...so maybe this is fine?

nemacysts · 2025-01-21T17:12:41Z

tron/serialize/runstate/dynamodb_state_store.py

+                exponential_delay = min(base_delay_seconds * (2 ** (attempts - 1)), max_delay_seconds)
+                # Full jitter (i.e. from 0 to exponential_delay) will help minimize the number and length of calls
+                jitter = random.uniform(0, exponential_delay)
+                delay = jitter


did we mean to add the exponential delay + the jitter? or is waiting for a random time between 0 and the expected delay what we wanted? (i.e., is waiting 0 seconds fine?)

oh, i see - boto is doing it's own exponential backoff?

imo, we can likely skip adding any jitter - each dynamodb table only has a single reader/writer (tron) so there's not much of a risk of a thundering herd scenario :)

nemacysts · 2025-01-21T17:19:21Z

tests/serialize/runstate/dynamodb_state_store_test.py

+        # We also need to verify that sleep was called with expected delays
+        expected_delays = []
+        base_delay_seconds = 0.5
+        max_delay_seconds = 10
+        for attempt in range(1, MAX_UNPROCESSED_KEYS_RETRIES + 1):
+            expected_delay = min(base_delay_seconds * (2 ** (attempt - 1)), max_delay_seconds)
+            expected_delays.append(expected_delay)
+        actual_delays = [call.args[0] for call in mock_sleep.call_args_list]
+        assert_equal(actual_delays, expected_delays)


i'd maybe extract the exponential backoff logic in tron/serialize/runstate/dynamodb_state_store.py to a function so that we can write a more targeted test for that and simplify this to checking if we called that function the right amount of times

(mostly 'cause I generally try to avoid for loops/calculations inside tests :p)

EmanElsaban · 2025-01-21T18:25:15Z

tron/serialize/runstate/dynamodb_state_store.py

-                            f"tron_dynamodb_restore_failure: failed to retrieve items with keys \n{failed_keys}\n from dynamodb\n{resp.result()}"
-                        )
-                        raise error
+                    result = resp.result()


I wonder if we should also print the response when we get into the exception block to also have an idea on why we got unprocessed keys and why we exceeded the attempts

so maybe we add it here

except Exception as e: log.exception("Encountered issues retrieving data from DynamoDB") raise e

KaspariK added 2 commits January 20, 2025 07:41

Add dynamodb retry config for throttling and other errors. Add expone…

e382156

…ntial backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail

Ignore botocore

3e74d75

KaspariK force-pushed the u/kkasp/TRON-2342-exponential-backoff-dynamo-get branch from b361235 to 3e74d75 Compare January 20, 2025 15:43

KaspariK added 3 commits January 20, 2025 08:24

Fix setitem loop now that we use Dynamo retry config. Update number f…

f88d7f9

…or retries in test assertion

Add timer back to exception block to capture failures

6902ab0

Add unit of measurement to base_delay and max_delay. Expand the retry…

b4e423d

…_reading test

KaspariK commented Jan 20, 2025

View reviewed changes

KaspariK requested review from nemacysts and EmanElsaban January 21, 2025 16:39

KaspariK marked this pull request as ready for review January 21, 2025 16:39

KaspariK requested a review from a team as a code owner January 21, 2025 16:39

nemacysts reviewed Jan 21, 2025

View reviewed changes

EmanElsaban reviewed Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dynamodb retry config for throttling and other errors. Add exponential backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail #1023

Add dynamodb retry config for throttling and other errors. Add exponential backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail #1023

KaspariK commented Jan 15, 2025 •

edited

Loading

KaspariK Jan 20, 2025

KaspariK Jan 20, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

EmanElsaban Jan 21, 2025

EmanElsaban Jan 21, 2025

Add dynamodb retry config for throttling and other errors. Add exponential backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail #1023

Are you sure you want to change the base?

Add dynamodb retry config for throttling and other errors. Add exponential backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail #1023

Conversation

KaspariK commented Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KaspariK commented Jan 15, 2025 •

edited

Loading