-
Notifications
You must be signed in to change notification settings - Fork 33
Add APIs for NIXL's UCCL backend #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Great work! Will talk a look very soon! |
@@ -23,17 +23,21 @@ std::once_flag glog_init_once; | |||
constexpr uint32_t kGpuStreamId = 0; | |||
|
|||
inline void check_python_signals() { | |||
#ifdef WITH_PYTHON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @praveingk , is this WITH_PYTHON macro required because the NIXL python layer has already handled it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @YangZhou1997 I started with WITH_PYTHON macro for c++ specific consumers although NIXL uses python, and still requires the python signals to be handled. I noticed the bool Endpoint::send(uint64_t conn_id, uint64_t mr_id, void const* data, size_t size, bool inside_python
, and thought this could be a replacement of that. But I can remove this macro, and keep the implementation as earlier to keep it simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I chose bool inside_python
because we want to provide both sync and async send/recv APIs. When python application uses the sync APIs calling send
, inside_python must be true to check signal; when python application uses the async APIs, we have a c++ native thread calling send
, inside_python must be false to avoid check signal (as I find native thread checking python signal is costly). To avoid accommdate the two APIs at the same application, I choose to use dynamic bool inside_python
instead of static macro. What do you think of?
Another way is to use c++ template like , but it seems hard to work if send
function is not implemented in the header file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense @YangZhou1997 Let me revert this file to its previous implementation.
DCHECK(size <= 0xffffffff) << "size must be less than 4GB"; | ||
[[maybe_unused]] auto _ = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you keep this [[maybe_unused]] auto _ = xxx
, as both python threads or c++ naive thread may call this function? For c++ native threads, py::gil_scoped_release{}
and check_python_signals()
is super slow, yet unnecessary.
@@ -317,7 +331,7 @@ bool Endpoint::send(uint64_t conn_id, uint64_t mr_id, void const* data, | |||
done[ureq_issued % kMaxInflightChunks] = false; | |||
ureq_issued++; | |||
} | |||
auto _ = inside_python ? (check_python_signals(), nullptr) : nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, here. thx
@@ -340,11 +354,10 @@ bool Endpoint::send(uint64_t conn_id, uint64_t mr_id, void const* data, | |||
return true; | |||
} | |||
|
|||
bool Endpoint::recv(uint64_t conn_id, uint64_t mr_id, void* data, size_t size, | |||
bool inside_python) { | |||
[[maybe_unused]] auto _ = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, here. thx
@@ -379,7 +392,7 @@ bool Endpoint::recv(uint64_t conn_id, uint64_t mr_id, void* data, size_t size, | |||
done[ureq_issued % kMaxInflightChunks] = false; | |||
ureq_issued++; | |||
} | |||
auto _ = inside_python ? (check_python_signals(), nullptr) : nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, here. thx
@@ -687,14 +688,18 @@ void UcclRDMAEngine::handle_install_ctx_on_engine(Channel::CtrlMsg& ctrl_work) { | |||
RDMAEndpoint::RDMAEndpoint(int num_engines_per_dev) | |||
: num_engines_per_dev_(num_engines_per_dev), | |||
stats_thread_([this]() { stats_thread_fn(); }) { | |||
#ifndef DISABLE_CALL_ONCE_STATIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here
bool called = false; | ||
#ifndef DISABLE_CALL_ONCE_STATIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here. I am worried that creating engines for each thread would create a huge number of engines in total.
@@ -954,9 +962,19 @@ void RDMAEndpoint::install_ctx_on_engines(int fd, int dev, PeerID peer_id, | |||
DCHECK(factory_dev) << "install_ctx_on_engines: get_factory_dev()"; | |||
|
|||
ret = send_message(fd, &factory_dev->gid.raw, 16); | |||
printf("Sent factory dev raw: %02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using UCCL_LOG_EP
?
@@ -1159,7 +1180,26 @@ ConnID RDMAEndpoint::uccl_accept(int dev, int listen_fd, int local_gpuidx, | |||
auto* factory_dev = RDMAFactory::get_factory_dev(dev); | |||
DCHECK(factory_dev) << "uccl_accept: get_factory_dev()"; | |||
|
|||
// Debug: Check if listen_fd is valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we still need this?
@@ -1244,8 +1245,14 @@ class UcclFlow { | |||
memset(&recv_comm_, 0, sizeof(recv_comm_)); | |||
int num_devices = ep->get_num_devices(); | |||
// Avoid all flows using the same initial engine offset. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this always needs to be static, otherwise, a local std::vector<std::atomic<uint32_t>>
has no impact on load balancing.
The changes are the following to support UCCL backend inside NIXL:
WITH_PYTHON
definitions to provide cpp only APIs.uccl_engine
APIs which are used by NIXL's UCCL backend for integration (Uccl plugin praveingk/nixl-uccl#1). WRITE works with the nixl plugin. READ is still not working (debugging in progress).DISABLE_CALL_ONCE_STATIC
definitions in rdma-transport code to provide clean way to initialize/destroy/re-initialize uccl plugin from another process (nixl). Additionally, exposing the socket fd used for coordination to be used to additional coordination by the uccl backend for NIXL.benchmark_nixl.py
to support uccl backend.