Skip to content

Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B #4064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 24, 2025

Conversation

lkm2835
Copy link
Contributor

@lkm2835 lkm2835 commented Mar 4, 2025

Motivation

https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct
EXAONE-3.5-2.4B-Instruct used head_dim 80.

--attention-backend flashinfer does not support head dim 80.
Then, I use --attention-backend triton to avoid flashinfer.

python3 -m sglang.launch_server --model-path /model/EXAONE-3.5-2.4B-Instruct --context-length 32768 --host 0.0.0.0 --port 9000 --tp 1 --disable-radix-cache --trust-remote-code --attention-backend flashinfer

Since flashinfer is still being used, the unsupported head_dim: 80 error still occurs when doing --attention-backend triton.

Error log

  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 245, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 312, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 391, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 384, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/exaone.py", line 312, in forward
    hidden_states = self.transformer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/exaone.py", line 281, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/exaone.py", line 230, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/exaone.py", line 167, in forward
    q, k = self.rotary_emb(positions, q, k)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/custom_op.py", line 14, in forward
    return self._forward_method(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/rotary_embedding.py", line 152, in forward_cuda
    apply_rope_with_cos_sin_cache_inplace(
  File "/usr/local/lib/python3.10/dist-packages/sgl_kernel/ops/__init__.py", line 55, in apply_rope_with_cos_sin_cache_inplace
    torch.ops.sgl_kernels.apply_rope_pos_ids_cos_sin_cache(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 106, in __torch_function__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: Error in function 'BatchQKApplyRotaryPosIdsCosSinCache' at /sgl-kernel/3rdparty/flashinfer/include/flashinfer/pos_enc.cuh:584: Unsupported head_dim: 80

Modifications

Checklist

@@ -147,7 +147,7 @@ def forward_cuda(
key: torch.Tensor,
offsets: Optional[torch.Tensor] = None,
) -> Tuple[torch.Tensor, torch.Tensor]:
if _is_cuda_available:
if _is_cuda_available and (self.head_size != 80):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lkm2835 Can you check if /sgl-kernel/3rdparty/flashinfer/include/flashinfer/pos_enc.cuh:584 has specific divisibility regarding the RotaryEmbedding shape? Some kernels requires shape/size to be wholly divisible by a fixed amount. I want to see if this is better than 80 fix value patch which may not cover all fail cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Qubitium Thanks for replying.
I checked the FlashInfer code.
https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/utils.cuh#L170-L197
Flashinfer only supports 64, 128, 256, 512.

@lkm2835 lkm2835 changed the title Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B (head_dim 80) Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B Mar 4, 2025
Copy link
Contributor

@Qubitium Qubitium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lkm2835 Looks good! All fail cases covered. This should help other models avoid the same pitfall too. Let's wait for repsonse from other reviewers.

@lkm2835
Copy link
Contributor Author

lkm2835 commented Mar 11, 2025

Hi @merrymercy.
Can you review this PR for EXAONE (and others) support on the latest sglang?

@Qubitium
Copy link
Contributor

@zhaochenyang20 Can you check this?

@zhaochenyang20
Copy link
Contributor

cc @zhyncs

@lkm2835
Copy link
Contributor Author

lkm2835 commented Mar 24, 2025

Hi @Ying1123 can you check this PR?
We want to use EXAONE-Deep-2.4B on the sglang

@Qubitium
Copy link
Contributor

@lkm2835 I have pinged the SGLang Slack channel to accelerate this merge.

@zhyncs zhyncs merged commit 2a206b2 into sgl-project:main Mar 24, 2025
18 of 20 checks passed
@lkm2835 lkm2835 deleted the fix-exaone branch April 25, 2025 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants