Skip to content

[KERNELS] Improve block sizes for batched matmul_ogs with small m/n/k. #7897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

yongjik
Copy link
Collaborator

@yongjik yongjik commented Aug 19, 2025

(Previously, block sizes could be much bigger than m/n/k.)

Example perf difference:

H100:
    B=500000 M=8 N=8 K=8
        >> torch.float16     0.850 ms -> 0.388 ms
        >> torch.bfloat16    0.828 ms -> 0.354 ms
        >> torch.float8_e5m2 0.829 ms -> 0.373 ms
    B=500000 M=16 N=16 K=16
        >> torch.float16     0.791 ms -> 0.381 ms
        >> torch.bfloat16    0.790 ms -> 0.382 ms
        >> torch.float8_e5m2 0.779 ms -> 0.366 ms

GB200:
    B=500000 M=8 N=8 K=8
        >> torch.float16     0.676 ms -> 0.314 ms
        >> torch.bfloat16    0.652 ms -> 0.297 ms
        >> torch.float8_e5m2 0.659 ms -> 0.294 ms
    B=500000 M=16 N=16 K=16
        >> torch.float16     0.622 ms -> 0.305 ms
        >> torch.bfloat16    0.606 ms -> 0.306 ms
        >> torch.float8_e5m2 0.616 ms -> 0.296 ms

(Previously, block sizes could be much bigger than m/n/k.)
@yongjik yongjik requested a review from ptillet as a code owner August 19, 2025 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant