-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B #4064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -147,7 +147,7 @@ def forward_cuda( | |||
key: torch.Tensor, | |||
offsets: Optional[torch.Tensor] = None, | |||
) -> Tuple[torch.Tensor, torch.Tensor]: | |||
if _is_cuda_available: | |||
if _is_cuda_available and (self.head_size != 80): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lkm2835 Can you check if /sgl-kernel/3rdparty/flashinfer/include/flashinfer/pos_enc.cuh:584
has specific divisibility regarding the RotaryEmbedding shape? Some kernels requires shape/size to be wholly divisible by a fixed amount. I want to see if this is better than 80
fix value patch which may not cover all fail cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Qubitium Thanks for replying.
I checked the FlashInfer code.
https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/utils.cuh#L170-L197
Flashinfer only supports 64, 128, 256, 512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lkm2835 Looks good! All fail cases covered. This should help other models avoid the same pitfall too. Let's wait for repsonse from other reviewers.
Hi @merrymercy. |
@zhaochenyang20 Can you check this? |
cc @zhyncs |
Hi @Ying1123 can you check this PR? |
@lkm2835 I have pinged the SGLang Slack channel to accelerate this merge. |
Motivation
https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct
EXAONE-3.5-2.4B-Instruct used head_dim 80.
--attention-backend flashinfer
does not support head dim 80.Then, I use
--attention-backend triton
to avoid flashinfer.Since flashinfer is still being used, the unsupported head_dim: 80 error still occurs when doing
--attention-backend triton
.Error log
Modifications
Checklist