Add APIs for NIXL's UCCL backend #186

praveingk · 2025-07-23T06:58:25Z

The changes are the following to support UCCL backend inside NIXL:

Added WITH_PYTHON definitions to provide cpp only APIs.
uccl_engine APIs which are used by NIXL's UCCL backend for integration (Uccl plugin praveingk/nixl-uccl#1). WRITE works with the nixl plugin. READ is still not working (debugging in progress).
Added DISABLE_CALL_ONCE_STATIC definitions in rdma-transport code to provide clean way to initialize/destroy/re-initialize uccl plugin from another process (nixl). Additionally, exposing the socket fd used for coordination to be used to additional coordination by the uccl backend for NIXL.
benchmark_nixl.py to support uccl backend.
Bug: Experienced the GPU bus ids were not matching because of Uppercase/Lowercase. Hence, added a normalization.

Signed-off-by: Pravein Govindan Kannan <[email protected]>

YangZhou1997 · 2025-08-12T06:53:50Z

Great work! Will talk a look very soon!

YangZhou1997 · 2025-08-18T18:13:44Z

p2p/engine.cc

@@ -23,17 +23,21 @@ std::once_flag glog_init_once;
 constexpr uint32_t kGpuStreamId = 0;

 inline void check_python_signals() {
+#ifdef WITH_PYTHON


Hi @praveingk , is this WITH_PYTHON macro required because the NIXL python layer has already handled it?

Hi @YangZhou1997 I started with WITH_PYTHON macro for c++ specific consumers although NIXL uses python, and still requires the python signals to be handled. I noticed the bool Endpoint::send(uint64_t conn_id, uint64_t mr_id, void const* data, size_t size, bool inside_python, and thought this could be a replacement of that. But I can remove this macro, and keep the implementation as earlier to keep it simple.

I see. I chose bool inside_python because we want to provide both sync and async send/recv APIs. When python application uses the sync APIs calling send, inside_python must be true to check signal; when python application uses the async APIs, we have a c++ native thread calling send, inside_python must be false to avoid check signal (as I find native thread checking python signal is costly). To avoid accommdate the two APIs at the same application, I choose to use dynamic bool inside_python instead of static macro. What do you think of?

Another way is to use c++ template like , but it seems hard to work if send function is not implemented in the header file.

Makes sense @YangZhou1997 Let me revert this file to its previous implementation.

YangZhou1997 · 2025-08-18T18:15:13Z

p2p/engine.cc

  DCHECK(size <= 0xffffffff) << "size must be less than 4GB";
-  [[maybe_unused]] auto _ =


can you keep this [[maybe_unused]] auto _ = xxx, as both python threads or c++ naive thread may call this function? For c++ native threads, py::gil_scoped_release{} and check_python_signals() is super slow, yet unnecessary.

YangZhou1997 · 2025-08-18T18:15:26Z

p2p/engine.cc

@@ -317,7 +331,7 @@ bool Endpoint::send(uint64_t conn_id, uint64_t mr_id, void const* data,
      done[ureq_issued % kMaxInflightChunks] = false;
      ureq_issued++;
    }
-    auto _ = inside_python ? (check_python_signals(), nullptr) : nullptr;


Similarly, here. thx

YangZhou1997 · 2025-08-18T18:15:33Z

p2p/engine.cc

@@ -340,11 +354,10 @@ bool Endpoint::send(uint64_t conn_id, uint64_t mr_id, void const* data,
  return true;
 }

-bool Endpoint::recv(uint64_t conn_id, uint64_t mr_id, void* data, size_t size,
-                    bool inside_python) {
-  [[maybe_unused]] auto _ =


Similarly, here. thx

YangZhou1997 · 2025-08-18T18:15:50Z

p2p/engine.cc

@@ -379,7 +392,7 @@ bool Endpoint::recv(uint64_t conn_id, uint64_t mr_id, void* data, size_t size,
      done[ureq_issued % kMaxInflightChunks] = false;
      ureq_issued++;
    }
-    auto _ = inside_python ? (check_python_signals(), nullptr) : nullptr;


Similarly, here. thx

YangZhou1997 · 2025-08-18T18:19:23Z

rdma/transport.cc

@@ -687,14 +688,18 @@ void UcclRDMAEngine::handle_install_ctx_on_engine(Channel::CtrlMsg& ctrl_work) {
 RDMAEndpoint::RDMAEndpoint(int num_engines_per_dev)
    : num_engines_per_dev_(num_engines_per_dev),
      stats_thread_([this]() { stats_thread_fn(); }) {
+#ifndef DISABLE_CALL_ONCE_STATIC


Similarly here

YangZhou1997 · 2025-08-18T18:20:10Z

rdma/transport.cc

  bool called = false;
+#ifndef DISABLE_CALL_ONCE_STATIC


Similarly here. I am worried that creating engines for each thread would create a huge number of engines in total.

YangZhou1997 · 2025-08-18T18:21:27Z

rdma/transport.cc

@@ -954,9 +962,19 @@ void RDMAEndpoint::install_ctx_on_engines(int fd, int dev, PeerID peer_id,
  DCHECK(factory_dev) << "install_ctx_on_engines: get_factory_dev()";

  ret = send_message(fd, &factory_dev->gid.raw, 16);
+  printf("Sent  factory dev raw: %02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x\n",


using UCCL_LOG_EP?

YangZhou1997 · 2025-08-18T18:22:12Z

rdma/transport.cc

@@ -1159,7 +1180,26 @@ ConnID RDMAEndpoint::uccl_accept(int dev, int listen_fd, int local_gpuidx,
  auto* factory_dev = RDMAFactory::get_factory_dev(dev);
  DCHECK(factory_dev) << "uccl_accept: get_factory_dev()";

+  // Debug: Check if listen_fd is valid


Wondering if we still need this?

YangZhou1997 · 2025-08-18T18:39:19Z

rdma/transport.h

@@ -1244,8 +1245,14 @@ class UcclFlow {
    memset(&recv_comm_, 0, sizeof(recv_comm_));
    int num_devices = ep->get_num_devices();
    // Avoid all flows using the same initial engine offset.
+


I think this always needs to be static, otherwise, a local std::vector<std::atomic<uint32_t>> has no impact on load balancing.

Add CPP APIs for NIXL's UCCL backend

917492a

praveingk marked this pull request as draft July 23, 2025 06:58

praveingk added 25 commits July 23, 2025 14:03

Fix formatting

d0baff0

Cleanup plugin

795e0d0

Build only librdma

225846f

Add metadata api

c1a5abb

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Fix linker error when running nixl

bf6790e

Merge branch 'main' into nixl_plugin

748a525

Fix formatting after merge

feaf329

Addtional coordination to start uccl_recv

ac6ae6d

Provide boostrap fd to be used for future

3dfdad1

Revert conn_destroy

83c28b1

Change endpoint statics

6d00651

Merge branch 'main' of github.com:uccl-project/uccl into nixl_plugin

66ce2f6

Fix the API changes

12b874d

Fix errors after merge

7cfe383

Residual fixes

c7f0513

Change to async APIs

8cb33ec

Merge branch 'main' of github.com:uccl-project/uccl into nixl_plugin

4836a4b

Normalize pci bus id

3b6aa71

Add with_python

541275b

Remove stray logs from merge

73c4e97

Remove stray include

0afc7d8

Add multiple iteration support to benchmark with uccl

6d271de

Add coordination for read

ba4c5fd

Fix error

064f4eb

Remove redundant

cf0d821

praveingk marked this pull request as ready for review August 11, 2025 03:13

praveingk requested review from YangZhou1997 and MaoZiming August 11, 2025 03:37

praveingk added 2 commits August 14, 2025 10:37

Merge branch 'main' of github.com:uccl-project/uccl into nixl_plugin

7f529cd

Add missed out commits after merge

d991edb

YangZhou1997 reviewed Aug 18, 2025

View reviewed changes

Merge branch 'main' into nixl_plugin

0e9b605

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add APIs for NIXL's UCCL backend #186

Add APIs for NIXL's UCCL backend #186

praveingk commented Jul 23, 2025 •

edited

Loading

Uh oh!

YangZhou1997 commented Aug 12, 2025

Uh oh!

YangZhou1997 Aug 18, 2025

Uh oh!

praveingk Aug 19, 2025

Uh oh!

YangZhou1997 Aug 19, 2025

Uh oh!

praveingk Aug 20, 2025

Uh oh!

YangZhou1997 Aug 18, 2025

Uh oh!

YangZhou1997 Aug 18, 2025

Uh oh!

YangZhou1997 Aug 18, 2025

Uh oh!

YangZhou1997 Aug 18, 2025

Uh oh!

YangZhou1997 Aug 18, 2025

Uh oh!

YangZhou1997 Aug 18, 2025

Uh oh!

YangZhou1997 Aug 18, 2025

Uh oh!

YangZhou1997 Aug 18, 2025

Uh oh!

YangZhou1997 Aug 18, 2025

Uh oh!

Uh oh!

		DCHECK(size <= 0xffffffff) << "size must be less than 4GB";
		[[maybe_unused]] auto _ =

Add APIs for NIXL's UCCL backend #186

Are you sure you want to change the base?

Add APIs for NIXL's UCCL backend #186

Conversation

praveingk commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YangZhou1997 commented Aug 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

praveingk commented Jul 23, 2025 •

edited

Loading