Skip to content

Add APIs for NIXL's UCCL backend #186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

Add APIs for NIXL's UCCL backend #186

wants to merge 29 commits into from

Conversation

praveingk
Copy link
Collaborator

@praveingk praveingk commented Jul 23, 2025

The changes are the following to support UCCL backend inside NIXL:

  1. Added WITH_PYTHON definitions to provide cpp only APIs.
  2. uccl_engine APIs which are used by NIXL's UCCL backend for integration (Uccl plugin praveingk/nixl-uccl#1). WRITE works with the nixl plugin. READ is still not working (debugging in progress).
  3. Added DISABLE_CALL_ONCE_STATIC definitions in rdma-transport code to provide clean way to initialize/destroy/re-initialize uccl plugin from another process (nixl). Additionally, exposing the socket fd used for coordination to be used to additional coordination by the uccl backend for NIXL.
  4. benchmark_nixl.py to support uccl backend.
  5. Bug: Experienced the GPU bus ids were not matching because of Uppercase/Lowercase. Hence, added a normalization.

@praveingk praveingk marked this pull request as draft July 23, 2025 06:58
@praveingk praveingk marked this pull request as ready for review August 11, 2025 03:13
@YangZhou1997
Copy link
Member

Great work! Will talk a look very soon!

@@ -23,17 +23,21 @@ std::once_flag glog_init_once;
constexpr uint32_t kGpuStreamId = 0;

inline void check_python_signals() {
#ifdef WITH_PYTHON
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @praveingk , is this WITH_PYTHON macro required because the NIXL python layer has already handled it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @YangZhou1997 I started with WITH_PYTHON macro for c++ specific consumers although NIXL uses python, and still requires the python signals to be handled. I noticed the bool Endpoint::send(uint64_t conn_id, uint64_t mr_id, void const* data, size_t size, bool inside_python, and thought this could be a replacement of that. But I can remove this macro, and keep the implementation as earlier to keep it simple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I chose bool inside_python because we want to provide both sync and async send/recv APIs. When python application uses the sync APIs calling send, inside_python must be true to check signal; when python application uses the async APIs, we have a c++ native thread calling send, inside_python must be false to avoid check signal (as I find native thread checking python signal is costly). To avoid accommdate the two APIs at the same application, I choose to use dynamic bool inside_python instead of static macro. What do you think of?

Another way is to use c++ template like , but it seems hard to work if send function is not implemented in the header file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense @YangZhou1997 Let me revert this file to its previous implementation.

DCHECK(size <= 0xffffffff) << "size must be less than 4GB";
[[maybe_unused]] auto _ =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you keep this [[maybe_unused]] auto _ = xxx, as both python threads or c++ naive thread may call this function? For c++ native threads, py::gil_scoped_release{} and check_python_signals() is super slow, yet unnecessary.

@@ -317,7 +331,7 @@ bool Endpoint::send(uint64_t conn_id, uint64_t mr_id, void const* data,
done[ureq_issued % kMaxInflightChunks] = false;
ureq_issued++;
}
auto _ = inside_python ? (check_python_signals(), nullptr) : nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, here. thx

@@ -340,11 +354,10 @@ bool Endpoint::send(uint64_t conn_id, uint64_t mr_id, void const* data,
return true;
}

bool Endpoint::recv(uint64_t conn_id, uint64_t mr_id, void* data, size_t size,
bool inside_python) {
[[maybe_unused]] auto _ =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, here. thx

@@ -379,7 +392,7 @@ bool Endpoint::recv(uint64_t conn_id, uint64_t mr_id, void* data, size_t size,
done[ureq_issued % kMaxInflightChunks] = false;
ureq_issued++;
}
auto _ = inside_python ? (check_python_signals(), nullptr) : nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, here. thx

@@ -687,14 +688,18 @@ void UcclRDMAEngine::handle_install_ctx_on_engine(Channel::CtrlMsg& ctrl_work) {
RDMAEndpoint::RDMAEndpoint(int num_engines_per_dev)
: num_engines_per_dev_(num_engines_per_dev),
stats_thread_([this]() { stats_thread_fn(); }) {
#ifndef DISABLE_CALL_ONCE_STATIC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here

bool called = false;
#ifndef DISABLE_CALL_ONCE_STATIC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here. I am worried that creating engines for each thread would create a huge number of engines in total.

@@ -954,9 +962,19 @@ void RDMAEndpoint::install_ctx_on_engines(int fd, int dev, PeerID peer_id,
DCHECK(factory_dev) << "install_ctx_on_engines: get_factory_dev()";

ret = send_message(fd, &factory_dev->gid.raw, 16);
printf("Sent factory dev raw: %02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using UCCL_LOG_EP?

@@ -1159,7 +1180,26 @@ ConnID RDMAEndpoint::uccl_accept(int dev, int listen_fd, int local_gpuidx,
auto* factory_dev = RDMAFactory::get_factory_dev(dev);
DCHECK(factory_dev) << "uccl_accept: get_factory_dev()";

// Debug: Check if listen_fd is valid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we still need this?

@@ -1244,8 +1245,14 @@ class UcclFlow {
memset(&recv_comm_, 0, sizeof(recv_comm_));
int num_devices = ep->get_num_devices();
// Avoid all flows using the same initial engine offset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this always needs to be static, otherwise, a local std::vector<std::atomic<uint32_t>> has no impact on load balancing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants