Refactor to prepare for parallel sampling #100

masahi · 2023-12-07T11:47:21Z

The main changes are

Change various data structures to associate a request with multiple generated sequences
Extract common bits in sync and staging engines into engine_common.py

There is no functional change, but since the change is big, I request a careful review.

masahi · 2023-12-07T11:50:25Z

serve/mlc_serve/engine/engine_common.py

+            # entries for the older tokens.
+            if (
+                len(self.current_batch) == 0
+                and num_tokens > self.max_num_batched_tokens


Previously this line was

num_new_batched_tokens > self.max_num_batched_tokens

but I believe the lhs should be num_tokens. Can you confirm? @elvin-n

masahi · 2023-12-07T11:54:30Z

serve/mlc_serve/engine/base.py

+    output_text: str
+    is_finished: bool = False
+
+
 @dataclass
 class RequestState:


Now this class basically acts like SequenceGroup in vllm.

masahi · 2023-12-08T20:45:43Z

serve/mlc_serve/engine/engine_common.py

+        prefix_idx = generation_sequence.next_start_position
+
+    # TODO(masahi): Figure out a way to remove this concat
+    token_ids = prompt_tokens + generation_sequence.generated_token_ids


This list concat is probably causing perf regression. Improving this is left for future work.

it should be very few tokens passed into decode_last_output, copying of them into new container and concatting the new token unlikely can lead to perf degradation

PR #102 changes this part. Although it adds more concats and I also agree with @elvin-n, it would be good to collect data and confirm.

masahi · 2023-12-08T20:46:53Z

serve/mlc_serve/engine/engine_common.py

+                if not gen_seq.is_finished:
+                    # TODO(masahi): No need to add prompt_token_ids here if we send
+                    # the prompt len instead
+                    token_ids = state.prompt_token_ids + gen_seq.generated_token_ids


This list concat is probably causing perf regression. Improving this is left for future work.
We need to update paged_cache_model.py if we remove prompt tokens here.

sunggg

Thank you @masahi for the refactoring!
Would you try running serve/tests/unittest/test_engine_with_samplers.py? I'm getting errors with Prometheus metric and finish reason for stopping condition.

sunggg · 2023-12-10T01:49:14Z

serve/mlc_serve/engine/staging_engine_worker.py

-                    "Preempt request to free %s tokens",
-                    len(request_to_remove.token_ids),
-                )
+            self.evict_request()

            if self.cache_manager.get_max_new_tokens() <= self.max_decode_steps:


Is this why we cannot move _adjust_batch to engine_common.py?

No, in the sync engine, there are calls to self._discard_cancelled_requests_from_queue() that are not in the staging engine.

https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/sync_engine.py#L263
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/sync_engine.py#L316

Interesting. Is there anything sync engine-specific thing about this method? Seems like cancellation logics are bit different. If the answer is no, maybe follow-up PR can further unify this.

serve/mlc_serve/engine/engine_common.py

sunggg · 2023-12-10T01:56:34Z

serve/mlc_serve/engine/engine_common.py

+    return True
+
+
+class EngineBase:


Would it make sense to move this base class to engine/base.py?
Also, I'm wondering if further unification with InferenceEngine and ScopedInferenceEngine is possible.

https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/base.py#L180

engine/base.py is mostly declarations of core data structures while engine_common.py is specifically for sharing common code between the two engines. So EngineBase belongs to the latter.

InferenceEngine and ScopedInferenceEngine are interface classes. I don't understand what you mean by "unifying" them with EngineBase.

Ah, sorry. What I meant is maybe we can unify InferenceEngine and ScopedInferenceEngine. I guess it is possible by letting sync engine do nothing for start and stop. I guess my tiny complaint is we have many similar-looking classes with small differences that might confuse the new beginners.

This reverts commit 5382004.

masahi · 2023-12-11T00:00:40Z

Would you try running serve/tests/unittest/test_engine_with_samplers.py? I'm getting errors with Prometheus metric and finish reason for stopping condition.

They are fixed now.

elvin-n · 2023-12-11T08:19:31Z

python -m pytest serve/tests

================================================================================== short test summary info ===================================================================================
FAILED serve/tests/unittest/test_staging_engine.py::test_single_request - RuntimeError: Error when calling GenerationLoopWorker: 'DummyCacheManager' object has no attribute 'free_request'
FAILED serve/tests/unittest/test_staging_engine.py::test_single_request_step_to_finish - RuntimeError: Error when calling GenerationLoopWorker: 'DummyCacheManager' object has no attribute 'free_request'
FAILED serve/tests/unittest/test_staging_engine.py::test_multiple_requests_wait_queue - RuntimeError: Error when calling GenerationLoopWorker: 'DummyCacheManager' object has no attribute 'free_request'
FAILED serve/tests/unittest/test_staging_engine.py::test_multiple_requests_preempt - RuntimeError: Error when calling GenerationLoopWorker: 'DummyCacheManager' object has no attribute 'free_request'
FAILED serve/tests/unittest/test_staging_engine.py::test_cache_evict_hang_staging - RuntimeError: Error when calling GenerationLoopWorker: 'DummyCacheManager' object has no attribute 'free_request'
FAILED serve/tests/unittest/test_staging_engine.py::test_big_prompt_fit_to_cache_staging - RuntimeError: Error when calling GenerationLoopWorker: 'DummyCacheManager' object has no attribute 'free_request'
FAILED serve/tests/unittest/test_sync_engine.py::test_single_request_step_to_finish - AttributeError: 'DummyCacheManager' object has no attribute 'free_request'
FAILED serve/tests/unittest/test_sync_engine.py::test_multiple_requests_wait_queue - AttributeError: 'DummyCacheManager' object has no attribute 'free_request'
FAILED serve/tests/unittest/test_sync_engine.py::test_multiple_requests_preempt - AttributeError: 'DummyCacheManager' object has no attribute 'free_request'
FAILED serve/tests/unittest/test_sync_engine.py::test_cache_evict_hang - AttributeError: 'DummyCacheManager' object has no attribute 'free_request'
FAILED serve/tests/unittest/test_sync_engine.py::test_big_prompt_fit_to_cache - AttributeError: 'DummyCacheManager' object has no attribute 'free_request'
========================================================================= 11 failed, 6 passed, 3 warnings in 10.30s ==========================================================================

masahi · 2023-12-11T08:51:14Z

@elvin-n Fixed.

serve/mlc_serve/model/paged_cache_model.py

elvin-n · 2023-12-11T10:21:01Z

serve/mlc_serve/engine/engine_common.py

+        prefix_idx = generation_sequence.next_start_position
+
+    # TODO(masahi): Figure out a way to remove this concat
+    token_ids = prompt_tokens + generation_sequence.generated_token_ids


it should be very few tokens passed into decode_last_output, copying of them into new container and concatting the new token unlikely can lead to perf degradation

serve/mlc_serve/engine/staging_engine.py

sunggg

Thank you @masahi for refactoring and @elvin-n for review!

masahi commented Dec 7, 2023

View reviewed changes

masahi added 15 commits December 8, 2023 00:07

wip

712ccec

wip

2d05640

wip

f3742b9

fix

a38c955

fix

b960d4d

fix

a5e6e37

refactor

3b2df21

more refactor

77e0e5f

wip

5808958

wip

e4f21b4

more refactor

4080146

more refactor

9d42deb

fixed

9eb92f8

fixed mypy

18b8e41

minor

bdb0be3

masahi force-pushed the parallel-sampling branch from e0770c3 to bdb0be3 Compare December 8, 2023 00:07

masahi marked this pull request as draft December 8, 2023 00:14

masahi added 2 commits December 8, 2023 03:19

msg clean

27da1c2

fix missing finish_reason

f9747ac

masahi marked this pull request as ready for review December 8, 2023 05:44

masahi added 6 commits December 8, 2023 06:06

remove unnecessary type annot on defaultdict

522edd7

Return requests state from get_requests_to_process

5585197

simplify typing

421d2ea

reduced list concat

a83c494

remove dict add and lookup

5382004

wrong comment

55045d0

masahi commented Dec 8, 2023

View reviewed changes

sunggg requested changes Dec 10, 2023

View reviewed changes

masahi added 5 commits December 10, 2023 20:08

Revert "remove dict add and lookup"

e7b6a3c

This reverts commit 5382004.

fix sampler test

d962435

make it possible to disable prometheus metrics

78ab330

collect metrics only in staging engine

70369fc

return False in stop_by_length if request is already finished

b16f787

masahi force-pushed the parallel-sampling branch from f73a430 to b16f787 Compare December 10, 2023 21:07

move check_stopping_sequences to engine_common.py

fd39416

masahi added 2 commits December 11, 2023 08:44

add missing free_request method to Dummy cache manager

c8b7f55

update Dummy cache manager to operate on sequence

1853a54

elvin-n reviewed Dec 11, 2023

View reviewed changes

fix request finish condition

242e3de

masahi force-pushed the parallel-sampling branch from 159ca14 to 242e3de Compare December 11, 2023 19:18

sunggg approved these changes Dec 11, 2023

View reviewed changes

sunggg merged commit 745ce71 into octoml:batch-serving Dec 11, 2023
1 check passed

masahi mentioned this pull request Dec 11, 2023

Fix for the case when max token is not set #105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to prepare for parallel sampling #100

Refactor to prepare for parallel sampling #100

masahi commented Dec 7, 2023 •

edited

Loading

masahi Dec 7, 2023

masahi Dec 7, 2023

masahi Dec 8, 2023

elvin-n Dec 11, 2023

sunggg Dec 11, 2023

masahi Dec 8, 2023

sunggg left a comment •

edited

Loading

sunggg Dec 10, 2023

masahi Dec 10, 2023

sunggg Dec 11, 2023

sunggg Dec 10, 2023

masahi Dec 10, 2023

sunggg Dec 11, 2023

masahi commented Dec 11, 2023

elvin-n commented Dec 11, 2023

masahi commented Dec 11, 2023

elvin-n Dec 11, 2023

sunggg left a comment

Refactor to prepare for parallel sampling #100

Refactor to prepare for parallel sampling #100

Conversation

masahi commented Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunggg left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi commented Dec 11, 2023

elvin-n commented Dec 11, 2023

masahi commented Dec 11, 2023

Choose a reason for hiding this comment

sunggg left a comment

Choose a reason for hiding this comment

masahi commented Dec 7, 2023 •

edited

Loading

sunggg left a comment •

edited

Loading