Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B #4064

lkm2835 · 2025-03-04T11:45:03Z

Motivation

https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct
EXAONE-3.5-2.4B-Instruct used head_dim 80.

--attention-backend flashinfer does not support head dim 80.
Then, I use --attention-backend triton to avoid flashinfer.

python3 -m sglang.launch_server --model-path /model/EXAONE-3.5-2.4B-Instruct --context-length 32768 --host 0.0.0.0 --port 9000 --tp 1 --disable-radix-cache --trust-remote-code --attention-backend flashinfer

Since flashinfer is still being used, the unsupported head_dim: 80 error still occurs when doing --attention-backend triton.

Error log

  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 245, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 312, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 391, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 384, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/exaone.py", line 312, in forward
    hidden_states = self.transformer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/exaone.py", line 281, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/exaone.py", line 230, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/exaone.py", line 167, in forward
    q, k = self.rotary_emb(positions, q, k)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/custom_op.py", line 14, in forward
    return self._forward_method(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/rotary_embedding.py", line 152, in forward_cuda
    apply_rope_with_cos_sin_cache_inplace(
  File "/usr/local/lib/python3.10/dist-packages/sgl_kernel/ops/__init__.py", line 55, in apply_rope_with_cos_sin_cache_inplace
    torch.ops.sgl_kernels.apply_rope_pos_ids_cos_sin_cache(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 106, in __torch_function__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: Error in function 'BatchQKApplyRotaryPosIdsCosSinCache' at /sgl-kernel/3rdparty/flashinfer/include/flashinfer/pos_enc.cuh:584: Unsupported head_dim: 80

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Qubitium · 2025-03-04T13:13:02Z

python/sglang/srt/layers/rotary_embedding.py

@@ -147,7 +147,7 @@ def forward_cuda(
        key: torch.Tensor,
        offsets: Optional[torch.Tensor] = None,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        if _is_cuda_available:
+        if _is_cuda_available and (self.head_size != 80):


@lkm2835 Can you check if /sgl-kernel/3rdparty/flashinfer/include/flashinfer/pos_enc.cuh:584 has specific divisibility regarding the RotaryEmbedding shape? Some kernels requires shape/size to be wholly divisible by a fixed amount. I want to see if this is better than 80 fix value patch which may not cover all fail cases.

@Qubitium Thanks for replying.
I checked the FlashInfer code.
https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/utils.cuh#L170-L197
Flashinfer only supports 64, 128, 256, 512.

Qubitium

@lkm2835 Looks good! All fail cases covered. This should help other models avoid the same pitfall too. Let's wait for repsonse from other reviewers.

lkm2835 · 2025-03-11T04:18:26Z

Hi @merrymercy.
Can you review this PR for EXAONE (and others) support on the latest sglang?

Qubitium · 2025-03-11T05:25:07Z

@zhaochenyang20 Can you check this?

zhaochenyang20 · 2025-03-12T04:42:34Z

cc @zhyncs

lkm2835 · 2025-03-24T00:03:53Z

Hi @Ying1123 can you check this PR?
We want to use EXAONE-Deep-2.4B on the sglang

Qubitium · 2025-03-24T00:54:24Z

@lkm2835 I have pinged the SGLang Slack channel to accelerate this merge.

Fix RotaryEmbedding for head_dim 80

090a7dc

lkm2835 requested review from merrymercy, Ying1123, zhyncs, ispobock and HaiShaw as code owners March 4, 2025 11:45

Qubitium reviewed Mar 4, 2025

View reviewed changes

Fix the condition to cover all cases

25bdfe6

lkm2835 changed the title ~~Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B (head_dim 80)~~ Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B Mar 4, 2025

Qubitium approved these changes Mar 4, 2025

View reviewed changes

zhyncs merged commit 2a206b2 into sgl-project:main Mar 24, 2025
18 of 20 checks passed

lkm2835 deleted the fix-exaone branch April 25, 2025 04:18

lkm2835 mentioned this pull request Apr 25, 2025

fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 #5733

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B #4064

Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B #4064

Uh oh!

lkm2835 commented Mar 4, 2025 •

edited

Loading

Uh oh!

Qubitium Mar 4, 2025

Uh oh!

lkm2835 Mar 4, 2025

Uh oh!

Qubitium left a comment

Uh oh!

lkm2835 commented Mar 11, 2025 •

edited

Loading

Uh oh!

Qubitium commented Mar 11, 2025

Uh oh!

zhaochenyang20 commented Mar 12, 2025

Uh oh!

lkm2835 commented Mar 24, 2025

Uh oh!

Qubitium commented Mar 24, 2025

Uh oh!

Uh oh!

Uh oh!

Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B #4064

Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B #4064

Uh oh!

Conversation

lkm2835 commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

Qubitium Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

lkm2835 Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

Qubitium left a comment

Choose a reason for hiding this comment

Uh oh!

lkm2835 commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Mar 11, 2025

Uh oh!

zhaochenyang20 commented Mar 12, 2025

Uh oh!

lkm2835 commented Mar 24, 2025

Uh oh!

Qubitium commented Mar 24, 2025

Uh oh!

Uh oh!

Uh oh!

lkm2835 commented Mar 4, 2025 •

edited

Loading

lkm2835 commented Mar 11, 2025 •

edited

Loading