Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Fix Flashinfer Correctness #7284

Merged
merged 1 commit into from
Aug 7, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions vllm/attention/backends/flashinfer.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ def __post_init__(self):
raise ValueError(
f"Only {supported_head_sizes} are supported for head_dim,",
f"received {self.head_dim}.")
self.is_profile_run = is_block_tables_empty(self.block_tables)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LiuXiaoxuanPKU I think this line is buggy, self.block_tables here is always a tensor so is_block_tables_empty should always return False. From what I've checked this will happen to work for flashinfer v0.1.2, but not v0.1.3, which will raise RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257 during profile run. (This is due to the if self.is_profile_run block never being run.)

I tried a quick fix of patching is_block_tables_empty by checking if block_tables has num_el == 0, which will pass the profile run fine. But this introduces logic issues for prefill, which obviously has a 0 length block table as well. Not sure what the best fix for this is, just wanted to raise this concern here.


def begin_forward(self):
if self.num_prefill_tokens > 0:
Expand All @@ -140,11 +141,14 @@ def begin_forward(self):
assert self.paged_kv_last_page_len is not None
batch_size = self.query_start_loc.shape[0] - 1
assert batch_size >= 0
# The prefill stage does not read kv cache.
# The profile run does not read kv cache.
# Both paged_kv_indices and paged_kv_last_page_len are empty.
# paged_kv_indptr is a zero tensor with size batch_size + 1.
self.paged_kv_indptr = torch.zeros(batch_size + 1,
device=self.device)
if self.is_profile_run:
self.paged_kv_indptr = torch.zeros(batch_size + 1,
device=self.device)
else:
self.paged_kv_indptr = self.paged_kv_indptr.to(self.device)
self.paged_kv_last_page_len = self.paged_kv_last_page_len.to(
self.device)
self.paged_kv_indices = self.paged_kv_indices.to(self.device)
Expand Down
Loading