-
Notifications
You must be signed in to change notification settings - Fork 33
Fix the remote gpu addr translation without nvshmem #286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@MaoZiming Hi Ziming, you can review here, we can keep this pr for tracking deploying the cpu_proxy internode_low_latency kernel on my side |
@MaoZiming @YangZhou1997 Hi All, Based on the results of running simple internode test, the UCCL_DeepEP dispatch and combine work now. I checked the remote address and offset, and they look good to me. One flaw is that the test does not quit normally; it seems that all ranks are not well synced at the end. I will fix this soon. |
@CalebZ9909 Thanks! Great work! I just got back, will review in the afternoon! |
// "dst_rank: %d, sm_id: %d, lane_id: %d, message_idx: %d, num_ring_addrs: " | ||
// "%d, cur_head: %llu, cur_tail: %llu, inflight: %llu\n", | ||
// rptr_val, lptr_val, bytes_val, dst_rank, sm_id, lane_id, message_idx, | ||
// num_ring_addrs, cur_head, cur_tail, inflight); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will remove these printf after the code works.
wrs[i].wr.rdma.remote_addr = S.remote_addr + i * bytes; | ||
// wrs[i].wr.rdma.remote_addr = S.remote_addr + i * bytes; | ||
|
||
wrs[i].wr.rdma.remote_addr = S.remote_addr + cmd.req_rptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CalebZ9909 Can you double check whether this offset (cmd.req_rptr
) is with respect to dispatch_rdma_recv_data_buffer
in LowLatencyBuffer
?
This might not be the correct offset since S.remote_addr
stores the start of the LowLatencyBuffer buffers[2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly for nvshmemi_ibgda_amo_nonfetch_add
, need to check the offset of dispatch_rdma_recv_count_buffer
with respect to rdma_buffer
in LowLatencyLayout
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will modify this
d1098b9
into
gpu-driven-enable-dual-proxy-deepep
No description provided.