Skip to content

Fix the remote gpu addr translation without nvshmem #286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 20, 2025

Conversation

CalebZ9909
Copy link
Collaborator

No description provided.

@CalebZ9909
Copy link
Collaborator Author

@MaoZiming Hi Ziming, you can review here, we can keep this pr for tracking deploying the cpu_proxy internode_low_latency kernel on my side

@CalebZ9909
Copy link
Collaborator Author

@MaoZiming @YangZhou1997 Hi All, Based on the results of running simple internode test, the UCCL_DeepEP dispatch and combine work now. I checked the remote address and offset, and they look good to me. One flaw is that the test does not quit normally; it seems that all ranks are not well synced at the end. I will fix this soon.

@MaoZiming
Copy link
Member

@CalebZ9909 Thanks! Great work! I just got back, will review in the afternoon!

@MaoZiming MaoZiming changed the base branch from main to gpu-driven-enable-dual-proxy-deepep August 18, 2025 17:15
// "dst_rank: %d, sm_id: %d, lane_id: %d, message_idx: %d, num_ring_addrs: "
// "%d, cur_head: %llu, cur_tail: %llu, inflight: %llu\n",
// rptr_val, lptr_val, bytes_val, dst_rank, sm_id, lane_id, message_idx,
// num_ring_addrs, cur_head, cur_tail, inflight);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will remove these printf after the code works.

wrs[i].wr.rdma.remote_addr = S.remote_addr + i * bytes;
// wrs[i].wr.rdma.remote_addr = S.remote_addr + i * bytes;

wrs[i].wr.rdma.remote_addr = S.remote_addr + cmd.req_rptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CalebZ9909 Can you double check whether this offset (cmd.req_rptr) is with respect to dispatch_rdma_recv_data_buffer in LowLatencyBuffer?
This might not be the correct offset since S.remote_addr stores the start of the LowLatencyBuffer buffers[2]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly for nvshmemi_ibgda_amo_nonfetch_add, need to check the offset of dispatch_rdma_recv_count_buffer with respect to rdma_buffer in LowLatencyLayout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will modify this

@CalebZ9909 CalebZ9909 merged commit d1098b9 into gpu-driven-enable-dual-proxy-deepep Aug 20, 2025
1 of 2 checks passed
@CalebZ9909 CalebZ9909 deleted the pr-278 branch August 20, 2025 09:40
@CalebZ9909 CalebZ9909 restored the pr-278 branch August 20, 2025 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants