-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[torch.compile] store inductor compiled Python file #12182
Conversation
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
@dataclasses.dataclass | ||
class InductorArtifact: | ||
hash_str: str = "" | ||
file_path: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inside the file_path
, we can find a call
function. that is the ultimate function inductor compiled and called.
"vllm.unified_attention_with_output", | ||
] | ||
# v0 uses full graph compilation | ||
self.splitting_ops = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that kv cache is completely hidden from torch.compile
, changing gpu memory utilization (and thus changing kv cache shape) will not affect torch.compile
. thus they can share the same compilation cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Left one small comment.
@@ -199,6 +219,7 @@ def wrap_inductor(graph: fx.GraphModule, | |||
"Inductor cache lookup failed. Please remove" | |||
f"the cache file {cache_data.cache_file_path} and try again." # noqa | |||
) | |||
inductor_artifact.file_path = inductor_compiled_graph.current_callable.__code__.co_filename # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we only update the file_path of graph_index
0? Do we only support fullgraph mode of torch.compile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to be misunderstanding. if graph_index == 0:
is only for logging, this line is executed for every piece of graph.
test failure is unrelated. merging. |
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
#11108 enabled the compilation cache, but it only stores a hash string, which is not human-readable.
this PR further stores the path of the python file compiled by inductor, so that we can manually check the result.
As a result, I can confirm that
torch.compile
does not do anything w.r.t. kv cache, thanks to #11677 . Therefore, we don't need to split the graph in v0. We can directly compile and load the full graph, which leads to faster compilation.