# Multi-GPU Communication Modes: NCCL, NVSHMEM, MPI, and LLM Training


## Four Main Communication Modes
   - Two-sided message passing: send / recv
   - One-sided remote memory access: put / get
   - Collective communication: all-reduce / all-gather / reduce-scatter / all-to-all
   - Direct remote pointer / load-store: remote GPU memory as pointer

## Setup topics
- Common multi-machine GPU communication: driver / CUDA / NCCL / NIC / GPUDirect RDMA
   - Multi-Node NVLink: MNNVL/IMEX domain/NCCL_MNNVL_ENABLE/CUMEM
   - Verification path:`nvidia-smi topo`, IMEX channel, NCCL tests, NCCL debug log

## Practical guide topics
- Normal writing order of host-side NCCL
   - Bandwidth sanity check and debug sequence after communicator initialization
   - DP all-reduce writing skeleton of NCCL Device API
   - Differences in responsibilities between`cudaSetDevice`, CUDA stream, and NCCL communicator
   - Code skeleton for single-process single-GPU/rank and single-process multi-GPU
   - How to create group-specific communicator when DP/TP mixed parallelism


## different layers 

PyTorch / MPI / OpenSHMEM
        |
        |  high-level API
        v
NCCL / UCC / MPI collectives
        |
        |  collective / communication runtime
        v
UCX / libfabric
        |
        |  transport / network abstraction
        v
InfiniBand / RoCE / Ethernet / EFA / shared memory / CUDA IPC / RDMA

## Summarize

The core issues of Multi-GPU communication are:

```text
每张 GPU 只能直接访问自己的 memory，
但训练或 HPC 程序需要很多 GPU 协同完成一个全局数学操作。
```

The communication system must ensure:

```text
数据在正确时间出现在正确 GPU 上，
并且不要让 GPU 大量 idle。
```

This is difficult not only because of limited bandwidth, but also because there are many control issues in communication:

- Who communicates with whom?
- How do send and recv match?
- Is the remote buffer ready?
- When will the data be ready?
- Is there background progress for asynchronous communication?
- When is it safe to read and write remote memory?
- Can communication overlap with GPU compute?
- Are temporary buffers, staging buffers, and registered memory needed?
- Which topology should be NVLink, NVSwitch, PCIe or InfiniBand/RDMA?

The core judgment of this session is:

```text
AI/HPC 的重要通信模式通常是 persistent 的。
也就是说，每个 iteration 都重复类似的通信关系、tensor shape 和 rank group。
```

So the optimization direction is:

```text
提前 setup
提前安排通信 schedule
减少每轮 send/recv rendezvous
把纯数据传输交给 NVLink/RDMA/copy engine/NCCL kernels
避免 CPU、GPU block/thread、message matching 成为 bottleneck
```

## Mode 1: Two-Sided Message Passing

### problems it solves

Two-sided communication is the most intuitive communication model:

```text
sender 调用 send
receiver 调用 recv
两边匹配成功后，数据从 sender buffer 进入 receiver buffer
```

It solves the problem of point-to-point data exchange:

```text
GPU/process A 有数据
GPU/process B 需要这份数据
```

### Implement logic

The key points of two-sided communication are: **The sender declares "Who do I want to send this data to", and the receiving end declares "Whose data do I plan to receive into which buffer?" The real matching occurs in the communication runtime, which is usually mainly carried by the data structure of the receiving end. **

The receiving end usually maintains two queues:

```text
posted receive queue:
  已经提交的 recv

unexpected message queue:
  已经到达但还没有匹配到 recv 的消息
```

A send/recv needs to match metadata:

```text
source rank
destination rank
communicator
tag
datatype/count
buffer address
message size
```

The sender and receiver know different information:

```text
发送端知道:
  send buffer
  count / datatype
  destination rank
  tag
  communicator

接收端知道:
  receive buffer
  count / datatype
  source rank 或 ANY_SOURCE
  tag
  communicator
```

The receiving end does not know where the send buffer is; the sending end does not know where the receive buffer is. Both sides are aligned through metadata such as`rank + tag + communicator + datatype/count`.

#### How does the receiving end work?

Receiver call:

```c
MPI_Recv(buf, count, type, src, tag, comm, &status);
```

Or non-blocking version:

```c
MPI_Irecv(buf, count, type, src, tag, comm, &req);
```

This tells the communication library:

```text
我准备接收一个消息；
如果它来自 src，tag 属于 tag，communicator 是 comm；
就把数据写入 buf。
```

The logic on the receiving end is usually:

```text
recv called
  -> search unexpected message queue
      -> match found:
           copy/RDMA data into user recv buffer
           mark recv complete
      -> no match:
           put this recv request into posted receive queue
           wait for future incoming message
```

If receive has been posted in advance:

```text
send arrives -> match recv -> copy/RDMA into receiver buffer
```

This is the smoothest situation because the data can go directly into the user-provided receive buffer.

#### How does the sender work?

Sender call:

```c
MPI_Send(sendbuf, count, type, dst, tag, comm);
```

Or non-blocking version:

```c
MPI_Isend(sendbuf, count, type, dst, tag, comm, &req);
```

This tells the communication library:

```text
我要把 sendbuf 里的 count 个 type 发给 dst；
这条消息的 tag 是 tag，communicator 是 comm。
```

The logic on the sending side is usually:

```text
send called
  -> build message header
       source rank
       destination rank
       tag
       communicator
       datatype/count/size
  -> choose protocol
       eager for small messages
       rendezvous for large messages
  -> send header and possibly payload
  -> complete immediately, wait, or return a request depending on API/protocol
```

When Blocking send returns, it usually means that`sendbuf`can be safely reused, but it does not necessarily mean that the remote application has finished processing the data. After Non-blocking send returns,`sendbuf`cannot be changed and must wait for`MPI_Wait`/ request completion.

#### What happens when send comes first?

If send arrives first but recv is not ready, small messages usually go through the eager protocol:

```text
sender:
  send header + payload

receiver:
  message arrives
  no matching recv
  -> store message in unexpected message queue

later:
  recv called
  -> search unexpected queue
  -> match found
  -> copy payload into user recv buffer
```

This requires the receiving end to have a temporary buffer to store unexpected payload, so eager is suitable for small messages.

Big news usually goes through the rendezvous protocol:

```text
sender:
  send RTS: request to send

receiver:
  RTS arrives
  no matching recv
  -> store metadata only, not the full payload

later:
  recv called
  -> match RTS
  -> receiver now has a real recv buffer
  -> send CTS: clear to send / permission to send

sender:
  transfer real payload

receiver:
  data lands in recv buffer
  -> recv complete
```

This is the origin of the "handshake" in two-sided communication: the sender and receiver must first be aligned through metadata, and then the data can safely enter the correct receive buffer.

### handshake and agreement

There are two common types of agreements:

```text
eager protocol:
  小消息直接发送，receiver 还没 post recv 时可能先放到 unexpected buffer。

rendezvous protocol:
  大消息先发 request-to-send，
  receiver 准备好后返回 permission-to-send，
  然后再传真正的数据。
```

The advantage of Rendezvous is that it avoids wasting temporary buffers for large messages; the disadvantage is that there are more handshakes and delays.

### sync and progress

#### Blocking

```c
MPI_Send(...);
MPI_Recv(...);
```

The characteristic of Blocking is that the calling thread will stop in the communication API until the operation meets the return conditions.

```text
blocking send:
  返回时，sendbuf 通常可以安全复用。
  但不一定表示远端应用已经处理完数据。

blocking recv:
  返回时，recvbuf 已经有有效数据。
```

advantage:

```text
语义简单
容易写对
调试和 benchmark 方便
```

shortcoming:

```text
CPU thread 会卡住
GPU compute 可能因为等待通信而 idle
不利于 overlap compute and communication
```

There is also a similar blocking host wait in GPU/NCCL:

```c
ncclAllReduce(..., comm_stream);
cudaStreamSynchronize(comm_stream);
optimizer_kernel<<<..., compute_stream>>>(grad);
```

Here`ncclAllReduce`just enqueues to`comm_stream`, and`cudaStreamSynchronize(comm_stream)`will make the CPU wait until the task before`comm_stream`is completed. It can ensure that the subsequent optimizer will not read`grad`too early, but the cost is that the CPU is blocked and cannot continue to enqueue subsequent GPU work in advance.

#### Non-blocking

```c
MPI_Irecv(..., &req_recv);
MPI_Isend(..., &req_send);
do_compute();
MPI_Waitall(...);
```

The core of Non-blocking is to separate "submit communication request" and "wait for communication to complete":

```text
1. 先 post request
2. API 返回 request handle
3. 程序继续做不依赖通信结果的 compute
4. 真正需要结果前再检查或等待 completion
```

But non-blocking only means that the API returns first, which does not mean that the data must automatically advance in the background. You also need to care about the progress mechanism.

Common completion/progress methods:

```text
MPI_Test / MPI_Testsome:
  非阻塞检查 request 是否完成。
  程序自己周期性调用，属于 polling/query 风格。

MPI_Wait / MPI_Waitall / MPI_Waitany / MPI_Waitsome:
  阻塞等待 completion。
  应用线程不再继续 compute，而是把控制权交给 MPI runtime 等待/推进通信。

background progress thread:
  runtime 开后台线程推进通信。
  主线程 Isend/Irecv 后继续 compute，后台线程处理网络 progress 和 completion。

NIC / RDMA hardware progress:
  CPU/runtime 提交 work request 后，NIC/RDMA engine 搬数据。
  数据路径可以由硬件推进，但 runtime 通常仍要处理 completion queue。

CUDA stream / event:
  GPU communication 例如 NCCL enqueue 到 CUDA stream。
  用 cudaEventRecord / cudaStreamWaitEvent 表达依赖，避免 CPU 手动轮询。

framework-level Work handle:
  例如 PyTorch dist.all_reduce(..., async_op=True) 返回 work。
  最后 work.wait()，底层再映射到 NCCL/Gloo/MPI/UCC 的 completion 机制。
```

The completion semantics of send/recv requests are different:

```text
send request complete:
  sendbuf 可以安全复用或修改。
  不一定表示远端应用已经处理完数据。

recv request complete:
  recvbuf 已经有有效数据。
  可以读取。
```

So the practical pattern is usually:

```text
先 Irecv
再 Isend
中间做不依赖 recvbuf 的 compute
最后 Wait/Test completion
```

If there is no overlapping compute in the middle, or the runtime/hardware does not really advance the communication, then non-blocking just postpones the waiting to the later`Wait`.

#### Key differences between CUDA streams/events

Both`cudaStreamSynchronize`and`cudaEventRecord + cudaStreamWaitEvent`can achieve the correctness goal of "using the result after communication is complete", but they are not equivalent operations.

`cudaStreamSynchronize`is a CPU-side wait:

```c
ncclAllReduce(..., comm_stream);
cudaStreamSynchronize(comm_stream);
optimizer_kernel<<<..., compute_stream>>>(grad);
```

The logic is:

```text
CPU:
  enqueue allreduce
  block on cudaStreamSynchronize
  wait until comm_stream completes
  then enqueue optimizer

comm_stream:
  allreduce

compute_stream:
  optimizer starts only after CPU unblocks and launches it
```

`cudaEventRecord + cudaStreamWaitEvent`is a GPU-side stream dependency:

```c
ncclAllReduce(..., comm_stream);
cudaEventRecord(comm_done, comm_stream);
cudaStreamWaitEvent(compute_stream, comm_done, 0);
optimizer_kernel<<<..., compute_stream>>>(grad);
```

The logic is:

```text
CPU:
  enqueue allreduce
  enqueue comm_done event
  enqueue compute_stream wait
  enqueue optimizer
  continue doing other work

comm_stream:
  allreduce -> record comm_done

compute_stream:
  wait comm_done -> optimizer
```

So the difference between the two is:

```text
cudaStreamSynchronize:
  CPU 等 GPU stream 完成。
  简单，但会阻塞 host thread。

cudaEventRecord + cudaStreamWaitEvent:
  一个 GPU stream 等另一个 GPU stream 的完成点。
  CPU 不阻塞，可以继续 enqueue work。
```

Training frameworks often prefer event/stream dependencies because they preserve asynchronous execution and overlap.`cudaStreamSynchronize`is more suitable for debugging, simple programs, or where host-side completion is explicitly required.

### Memory usage

May require:

- sender buffer
- receiver buffer
- unexpected message buffer
- temporary staging buffer
- pinned host memory
- GPU memory registration cache
- protocol metadata queues

Without GPUDirect RDMA, the path might become:

```text
GPU memory -> CPU pinned memory -> NIC -> network -> CPU pinned memory -> GPU memory
```

The ideal path with GPUDirect RDMA is:

```text
GPU memory -> NIC/RDMA -> remote GPU memory
```

### Main questions

The problem with Two-sided communication is not just "data transfer is slow". More precisely, it encounters bandwidth, latency, synchronization, progress, metadata, topology, and memory management issues simultaneously.

#### 1. Synchronization / ordering / progress issues

`send/recv`misalignment falls into this category. It is not a pure bandwidth issue, but a two-sided message matching and synchronization timing issue.

```text
send 已经到了，但 recv 还没 post
  -> send 可能等，或者消息进入 unexpected queue

recv 已经 post，但 send 还没 ready
  -> recv 等
```

In blocking mode, the performance is:

```text
MPI_Send / MPI_Recv 卡住
```

In non-blocking mode, the behavior is:

```text
MPI_Isend / MPI_Irecv 先返回
但最后 MPI_Wait / stream sync / collective completion 处等待
```

So async does not eliminate misalignment, but gives compute/communication overlap a chance. When you can't hide it, you will still be exposed by waiting.

Solution:

```text
先 post receive，再 send:
  MPI_Irecv(...)
  MPI_Isend(...)
  do_compute()
  MPI_Waitall(...)

保证所有 rank 的通信顺序一致:
  rank 0: op A -> op B
  rank 1: op A -> op B

用 batch/group 提交 P2P:
  NCCL ncclGroupStart / ncclGroupEnd
  PyTorch batch_isend_irecv

用 CUDA stream/event 表达依赖:
  data_ready 后再启动通信
  comm_done 后再使用结果

减少 rank arrival skew:
  平衡 batch、tokens、pipeline stage、expert load
```

#### 2. Message matching / metadata overhead

Two-sided send/recv needs to match:

```text
source rank
destination rank
communicator
tag
datatype/count
message size
buffer address
```

This introduces runtime complexity. Especially:

```text
ANY_SOURCE / wildcard
复杂 MPI datatype
tags
多个 rank 共享同一 GPU
```

These semantics are very general, but they are difficult to implement into hardware efficiently, and they are not suitable for GPU threads/blocks to do complex matching on their own.

Solution:

```text
避免 wildcard:
  尽量用明确 source rank，不用 ANY_SOURCE。

简化 message shape:
  尽量使用 contiguous buffer，避免复杂 datatype。
  所以在 GPU 通信里，通常宁愿多做一个高效 pack/unpack kernel，也不想让通信层处理复杂、零散的数据形状。“简化 message shape”就是把通信数据整理成连续、规则、统一 datatype 的 buffer，让通信库只负责高效搬一整块 bytes，而不是处理复杂的不连续结构。

减少 tag/matching 复杂度:
  固定通信顺序和 tag 规则。

能用 collective 就用 collective:
  all-reduce / all-gather / reduce-scatter / all-to-all
  让 NCCL/MPI collective runtime 处理 schedule。

persistent pattern 提前 setup:
  communicator、peer list、buffer layout、message order 提前固定。
```

#### 3. Progress question

Non-blocking just returns the API first and does not guarantee that the communication will actually advance in the background.

Requires dependencies:

```text
MPI_Test / MPI_Wait 触发 progress
background progress thread
NIC / RDMA hardware progress
CUDA stream execution
framework-level Work handle
```

If there is no real progress, even if compute is performed in the middle, the communication may not advance until`Wait`, and the overlap will fail.

Solution:

```text
确认 runtime progress 机制:
  是否有 progress thread？
  MPI_Test / MPI_Wait 是否会推进？
  NIC/RDMA 是否能 offload 数据路径？

在 CPU 侧:
  用 MPI_Test/MPI_Testsome 周期性推进，而不是完全不碰通信。
  或启用 MPI/UCX/vendor runtime 的 async progress 配置。

在 GPU/NCCL 侧:
  把通信 enqueue 到 CUDA stream。
  用 cudaEventRecord / cudaStreamWaitEvent 建依赖。

在框架侧:
  PyTorch async_op=True 后保留 Work handle。
  真正使用结果前 work.wait()。
```

#### 4. Rendezvous / setup duplication overhead

Big news usually goes rendezvous:

```text
RTS: request to send
CTS: clear to send / permission to send
real payload transfer
```

If each iteration repeats the same communication pattern, but re-does send/recv rendezvous every time, the opportunity for persistent communication is wasted.

This is also the point emphasized in session:

```text
AI/HPC 通信 pattern 通常是 persistent 的；
应该尽量提前 setup，amortize 掉 setup 和 matching 成本。
```

Solution:

```text
把重复通信 pattern 固定下来:
  rank group
  peer order
  buffer layout
  tensor shape
  tag/order

提前建立通信对象:
  NCCL communicator
  MPI communicator
  NVSHMEM symmetric heap
  registered memory / memory registration cache

减少每轮动态 rendezvous:
  用 collective 代替手写 send/recv。
  用 one-sided put/get 或 direct pointer 代替 repeated matching。
  对 repeated P2P 使用固定 schedule。

把 setup 成本摊到很多 iteration:
  setup once
  run many steps
```

#### 5. Bandwidth and latency issues

The bandwidth issues are:

```text
数据量太大，链路搬不动
```

The latency issues are:

```text
消息很多很小，每条消息都有启动成本
```

Common levels in GPU communication are:

```text
HBM > NVLink/NVSwitch > PCIe > InfiniBand/RDMA > Ethernet
```

Two-sided small news may also bear additional:

```text
message header
matching
queue operation
RTS/CTS handshake
completion processing
```

Solution:

```text
提高带宽利用:
  合并小消息成大 bucket。
  使用 contiguous/coalesced buffer。
  让 NCCL 选择合适 ring/tree/protocol。

降低 latency 影响:
  batch/group 多个通信操作。
  减少 kernel launch 和 API launch 次数。
  小消息尽量融合进更大的 collective 或 fused kernel。

选择正确路径:
  同机优先 NVLink/NVSwitch。
  跨机尽量启用 GPUDirect RDMA。
  避免不必要的 host staging。

overlap compute and communication:
  gradient bucket ready 后尽早通信。
  后续不依赖结果的 compute 继续跑。
```

#### 6. Topology and hot issues

The paths are different between different GPUs/nodes:

```text
同机: NVLink / NVSwitch / PCIe
跨机: NIC / InfiniBand / RDMA / Ethernet / EFA
```

If the communication schedule does not consider topology, the following may occur:

```text
某些链路拥塞
某些 rank 成为 straggler
all-to-all 或 MoE token dispatch 产生热点
```

Solution:

```text
让通信库感知 topology:
  使用 NCCL communicator/topology detection。
  不要轻易手写错误的通信路径。

优化 all-to-all schedule:
  circulant shift
  staggered schedule
  multi-stream
  auto-tuning，例如 cuDecomp 这类思路

处理 load imbalance:
  MoE 做 expert load balancing。
  pipeline parallel 平衡 stage 时间。
  data/tensor/context parallel 切分要均匀。

减少热点:
  token dispatch 做 capacity/balancing。
  避免所有 rank 同时打同一链路或同一目标。
```

Among them,`multi-stream`means: use multiple CUDA streams / NCCL streams to submit multiple groups of communications concurrently, instead of stringing all P2P operations in one stream.

A single stream may be:

```text
comm_stream:
  send block 0 -> send block 1 -> send block 2 -> send block 3
```

multi-stream can be:

```text
comm_stream_0:
  send block 0

comm_stream_1:
  send block 1

comm_stream_2:
  send block 2
```

A stream may not be full of bandwidth, but it is often encountered in P2P / all-to-all / small-block communication:

```text
单队列顺序执行，独立通信被串行化
单个 P2P pair 暴露的并发度不够
in-flight bytes 不够，链路没有持续吃满
小消息 latency / protocol / completion 开销占比高
NCCL send/recv 可能对单个 pair 做保守 throttling
```

This sentence on pages 51-52 of Hammond slides corresponds to this phenomenon:

```text
NCCL send-recv assume collective usage patterns
and throttle themselves, hence a single pair of
send-recv may not saturate the NVL.

Multiple streams should allow me to override the
throttling behavior...
```

What this means is: when you use`ncclGroupStart()`/`ncclGroupEnd()`to wrap many`ncclSend`/`ncclRecv`, NCCL does not just treat each P2P as a completely independent message, but tends to treat this set of operations as an overall communication graph.

For example:

```c
ncclGroupStart();

for (int peer = 0; peer < nranks; peer++) {
    ncclSend(sendbuf[peer], count, type, peer, comm, stream);
    ncclRecv(recvbuf[peer], count, type, peer, comm, stream);
}

ncclGroupEnd();
```

On the surface this is a lot of P2P:

```text
send to rank 0
recv from rank 0
send to rank 1
recv from rank 1
...
```

But the overall pattern is very collective, such as all-to-all:

```text
这一轮所有 rank 都和所有 rank 交换一块数据。
```

So NCCL will focus more on overall scheduling rather than letting a single send/recv pair monopolize the link:

```text
控制同时 in-flight 的量
避免把所有消息一次性冲进网络队列
避免某个 pair 抢占 NVLink/NVSwitch/NIC 资源
让整体通信图更均衡
```

This is what`collective-like pattern`means: **API is P2P, but this group of P2P semantically expresses a collective-style communication graph. **

Different from the common behavior of MPI: there are many`MPI_Isend`/`MPI_Irecv`written in MPI, the runtime usually does not automatically assume that you are writing all-to-all, nor does it automatically apply collective flow control. For collective scheduling, collective APIs such as`MPI_Alltoall`are usually called directly.

The role of multi-stream is to increase concurrency and the amount of data in transit:

```text
更多 independent operations 同时暴露给 runtime
更多 chunks 同时 in flight
更容易使用多条 NVLink lanes / NIC queues / internal channels
减少单个 stream 上的 gap
```

But multi-stream is not that more is better. Too many streams will bring:

```text
调度开销
更复杂的 ordering 和 event dependency
链路拥塞
cache / SM / copy engine / NIC queue 竞争
和 NCCL 自己的调度策略互相干扰
```

So the practical approach is auto-tune:

```text
测 nStreams = 1, 2, 4, 8 ...
计算 achieved bandwidth = transferred bytes / elapsed time
如果增加 stream 后带宽不再上涨，说明进入 plateau，基本接近瓶颈
```

#### 7. Memory usage and staging issues

Two-sided runtime may require:

```text
unexpected message buffer
temporary staging buffer
pinned host buffer
GPU memory registration cache
protocol metadata queues
```

Without GPUDirect RDMA, the path may degenerate to GPU to CPU pinned memory to NIC, increasing copying and memory usage.

Solution:

```text
减少 temporary buffer:
  用 in-place collective。
  用 reduce-scatter 代替 all-reduce + slice。
  用 direct remote pointer 避免 get 到临时 T。

减少 host staging:
  使用 CUDA-aware MPI / NCCL / GPUDirect RDMA。
  确认 GPU-NIC 拓扑支持直接路径。

重用 buffer:
  gradient buckets
  communication workspace
  memory pool
  registration cache

提前注册/分配:
  pinned host memory
  registered GPU memory
  NVSHMEM symmetric heap
```

#### 8. Too much CPU intervention

If every communication requires CPU involvement:

```text
发起 send/recv
处理 metadata
做 message matching
poll completion
launch 后续 kernel
```

The GPU may be waiting for host/network events, causing the compute pipeline to be interrupted. This is why NCCL/NVSHMEM/stream/event/device-side communication are all trying to reduce CPU intervention on the hot path.

Solution:

```text
减少 host-side wait:
  少用 cudaStreamSynchronize 做中间同步。
  用 cudaEventRecord / cudaStreamWaitEvent 建 GPU-side dependency。

减少 CPU 发起次数:
  batch/group communication。
  CUDA Graphs 或框架级 graph capture。
  bulk kernel 代替大量小 kernel launch。

把通信放进 GPU execution model:
  NCCL enqueue 到 CUDA stream。
  NVSHMEM device-side put/get。
  fused compute-communication kernel。

让 runtime/hardware 处理数据路径:
  NCCL collective schedule
  GPUDirect RDMA
  NIC offload
```

## Mode 2: One-Sided Remote Memory Access

### problems it solves

One-sided communication separates synchronization and data movement:

```text
put: initiator 把本地数据写到远端 memory
get: initiator 从远端 memory 读到本地
```

The remote end does not need to call`recv`at the same time.

What it solves is:

```text
我知道远端数据/目标 buffer 在哪里，
我想直接读或写，
不想每次都做 send/recv matching。
```

### Existing tools

- OpenSHMEM
- NVSHMEM
- MPI RMA: `MPI_Put`, `MPI_Get`, windows
- RDMA verbs / libibverbs
- UCX RMA
- CUDA-aware communication libraries using GPUDirect RDMA

### Implement logic

One-sided usually requires setup in advance:

```text
1. 分配或注册远端可访问 memory
2. 交换地址、offset、remote key、PE/rank 信息
3. 建立网络或 NVLink 可达路径
4. 后续 put/get 直接搬数据
```

A typical model of NVSHMEM is symmetric heap:

```text
每个 PE/GPU 都贡献一块 symmetric memory
相同 allocation 在不同 PE 上有对应关系
```

Logically:

```c
ptr = nvshmem_malloc(size);
nvshmem_putmem(remote_ptr, local_ptr, size, target_pe);
nvshmem_getmem(local_ptr, remote_ptr, size, target_pe);
```

### Handshake and synchronization

One-sided handshakes mainly occur in advance:

```text
memory registration
address/rkey exchange
heap setup
connection setup
```

When actually putting/getting, the other party does not need to post recv at the same time.

But the synchronization does not disappear, it is just transferred to the program or library:

```text
什么时候可以写远端 buffer？
什么时候远端可以读我写的数据？
什么时候本地 get 的结果可用？
```

Commonly used synchronization tools:

- barrier
- fence
- quiet
- signal / wait
- atomic
- stream/event dependency

### communication path

The same NVLink/NVSwitch domain:

```text
GPU load/store or NVLink transaction -> remote GPU memory
```

Across nodes:

```text
GPU memory -> NIC/RDMA -> network -> remote GPU memory
```

If the hardware or runtime does not support direct paths, there may be a fallback to host involvement.

### Memory usage

Usually requires:

- symmetric heap or registered buffer
- remote address/offset metadata
- remote keys/registration metadata
- completion/signal variables
- Sometimes temporary local buffer is needed

Compared with two-sided, it can reduce unexpected message buffer and per-message matching overhead.

### Main questions

- Programmers must ensure correct synchronization themselves
- Reading and writing races are more likely to occur
- Remote completion and local completion are not the same thing
- small put/get may be inefficient and requires coalescing
- thread-level put/get may generate many small messages
- The performance and semantics of put/get across networks are more complex than direct access within NVLink

## Mode 3: Collective Communication

### problems it solves

Collective is a group of GPUs that participate in communication operations together.

The most common ones in LLM training are:

```text
all-reduce
reduce-scatter
all-gather
broadcast
all-to-all
```

It solves the global collaboration problem rather than the message problem of a single pair.

example:

```text
Data parallel:
  所有 GPU 算出各自 gradients
  需要 all-reduce 得到 global gradient

Tensor parallel:
  每个 GPU 只有部分 activation 或 weight shard
  需要 all-gather / all-reduce / reduce-scatter

MoE:
  token 根据 expert routing 发到不同 GPU
  需要 all-to-all
```

### Existing tools

- NCCL: the main force of GPU collective
- MPI collectives:`MPI_Allreduce`,`MPI_Alltoall`, etc.
- RCCL: AMD GPU collective library
- oneCCL: Intel ecosystem
- PyTorch distributed:`torch.distributed`- DeepSpeed, Megatron-LM, Transformer Engine, FSDP, DTensor
- cuDecomp: domain decomposition / all-to-all scheduling

### Implement logic

The core of Collective is:

```text
所有 rank 在同一个 communicator 中，
按相同顺序调用同一个 collective。
```

It is not required that all ranks enter in the same nanosecond, but the order must be consistent.

NCCL is usually mastered during the communicator/setup phase:

```text
rank list
GPU topology
NVLink/NVSwitch/PCIe/IB path
message size
algorithm choice
protocol choice
stream
```

Then the data is cut into chunks and scheduled through ring/tree/pipeline.

Ring all-reduce intuition:

```text
每个 GPU 把一部分数据传给下一个 GPU
边传边 reduce
多轮之后每个 GPU 都拿到最终结果
```

All-to-all intuition:

```text
每个 GPU 都给每个其他 GPU 发一块数据
```

### Handshake and synchronization

Collective's handshake is not matched individually for each message, but is implicit through the collective call itself:

```text
所有 rank 都同意：现在做这个 collective。
```

If the order is wrong:

```text
rank 0: allReduce(A), allReduce(B)
rank 1: allReduce(B), allReduce(A)
```

Maybe hang.

NCCL operations typically enqueue to CUDA streams:

```c
ncclAllReduce(sendbuf, recvbuf, count, type, ncclSum, comm, stream);
```

Returning from the call does not mean completion, it only means the operation has entered the stream. The real completion depends on stream/event.

### communication path

Single machine with multiple GPUs:

```text
HBM -> NVLink/NVSwitch/PCIe -> remote HBM
```

Multiple machines and multiple GPUs:

```text
HBM -> NIC/RDMA -> InfiniBand/Ethernet/EFA -> remote NIC -> remote HBM
```

NCCL will try to select paths and algorithms based on topology.

### Memory usage

May require:

- send buffer
- receive buffer
- in-place buffer
- temporary scratch buffer
- NCCL internal buffers
- gradient buckets
- FSDP/ZeRO shards
- pipeline activation buffers

Common memory tradeoffs in LLM training:

```text
Data parallel:
  通信简单，但每张 GPU 存完整模型/optimizer state。

FSDP/ZeRO:
  参数、gradient、optimizer state 切分，省显存；
  但需要 all-gather 参数、reduce-scatter gradient。

Tensor parallel:
  降低单卡参数/activation 压力；
  但每层增加 collective。
```

### Main questions

- The rank order must be consistent
- straggler will slow down all ranks
- all-to-all is easily affected by small messages and network hotspots
- The overlap between collective and compute is critical
- User handwritten send/recv version of all-to-all is often not as good as library scheduling
- Don’t outsmart NCCL unless you clearly understand the topology and protocol behavior

## Mode 4: Direct Remote Pointer/Load-Store

### problems it solves

This is a powerful mode of NVSHMEM within the NVLink/NVSwitch domain:

```text
远端 GPU memory 可以被映射成当前 GPU 可访问的 pointer。
```

The code is similar:

```c
remote_ptr = nvshmem_ptr(local_address, remote_pe);
```

Then the kernel can use`remote_ptr`to read remote GPU memory.

What this solves is:

```text
我不想显式 send/recv
也不想显式 put/get 到 temporary buffer
我想在 compute kernel 中直接读远端数据并计算
```

### Existing tools

- NVSHMEM `nvshmem_ptr`
- CUDA peer access / UVA in some lower-level contexts
- NVLink / NVSwitch memory mapping
- PyTorch SymmetricMemory and similar framework-level experiments
- NCCL symmetric memory and device API direction, depending on available versions/features

### Implement logic

Setup phase:

```text
1. NVSHMEM init
2. symmetric allocation
3. runtime 建立 remote GPU memory mapping
4. 当前 GPU 获取远端 pointer
```

Running phase:

```text
CUDA kernel 里直接 load remote_ptr[index]
计算
写本地 B 或远端 memory
```

In the distributed transpose example of session, the meaning of direct pointer is:

```text
不用先 get 远端 block 到 temporary T
而是直接拿远端 A block 的 pointer
transpose kernel 直接从远端读
```

Explicit communication steps and temporary movements are thus reduced.

### Handshake and synchronization

Direct pointers reduce explicit communication APIs but do not eliminate synchronization issues.

Still have to ensure:

```text
远端数据已经写好
本地 kernel 开始读时数据可见
不会在读的时候被远端覆盖
写结果后其他 GPU 读取前已经完成
```

Common methods:

- barrier
- stream/event
- NVSHMEM quiet/fence
-`__syncthreads()`in the kernel only synchronizes this block and does not synchronize across GPUs.
- More advanced signal/atomic protocol

### communication path

Ideal situation:

```text
GPU SM load instruction
-> NVLink/NVSwitch
-> remote GPU HBM
-> data returned to local GPU
```

This looks like a normal load, but the bandwidth and latency follow the remote link, not the local HBM.

If the remote GPU is not in a directly mappable domain,`nvshmem_ptr`may not be available and fallback to put/get or network path is required.

### Memory usage

Advantages:

- Can reduce temporary receive buffer
- Can reduce the steps of copy into local T
- Communication can be integrated into compute kernel

Still required:

- symmetric heap
-mapping metadata
-synchronization variables
- fallback buffer is sometimes needed

### Main questions

- Depends on topology, most effective in NVLink/NVSwitch
- remote load is much slower than local HBM
- Access mode must be coalesced, otherwise performance will be poor
- Cross-network is not suitable for ordinary pointer thinking
- Memory consistency and lifecycle management are more difficult
- The program is closer to a handwritten protocol, making debugging more difficult

https://developer.nvidia.com/blog/fusing-communication-and-compute-with-new-device-api-and-copy-engine-collectives-in-nvidia-nccl-2-28/
![](../../assets/multi_gpu_communication_modes/file-20260609114509549.png)
## Setup：Multi-Node NVLink / IMEX / NCCL

This section is a checklist at the system configuration level. The questions it answers are:

```text
如果多个 host 之间硬件上有 NVLink / NVSwitch，
怎么让 CUDA/NCCL 真正把它识别成可跨节点访问的 GPU memory domain？
```

Core conclusion:

```text
硬件有 NVLink 不等于应用一定能跨 host load/store。

还需要：
  driver 支持
  CUDA 支持
  NCCL 支持
  IMEX domain 正确
  IMEX channel 权限正确
  NCCL 环境变量/运行时配置正确
```

### 1. First distinguish between ordinary multi-machine and Multi-Node NVLink

Ordinary multi-machine GPU communication:

```text
Host A GPU memory
  -> NIC / InfiniBand / RDMA / Ethernet
  -> Host B GPU memory
```

In this case, cross-host GPUs usually do not belong to the same LSA team. Cross-node communication mainly relies on:

```text
NCCL collective
GPUDirect RDMA
GIN
network plugin
```

Multi-Node NVLink / NVLink Switch System：

```text
Host A GPU memory
  -> NVLink / NVSwitch fabric
  -> Host B GPU memory
```

If the software stack is configured correctly, cross-host GPUs may enter the same load/store accessible domain:

```text
CUDA P2P load/store works across hosts
NCCL LSA team may include GPUs across hosts
ncclGetLsaPointer / ncclGetPeerPointer may work across hosts
LSA barrier may synchronize across those GPUs
```

### 2. Driver / CUDA / NCCL must match

Requires compatible NVIDIA data center driver, CUDA runtime and NCCL versions. The reason is:

```text
driver:
  负责 GPU、NVLink/NVSwitch、P2P memory mapping、fabric handle 等底层能力。

CUDA:
  负责进程里的 GPU memory allocation、cuMem/cuda P2P、fabric memory handle。

NCCL:
  负责识别 topology，并决定是否使用 MNNVL、NVLink、RDMA 或 fallback path。
```

For NCCL MNNVL, common environment variables:

```bash
export NCCL_MNNVL_ENABLE=1
export NCCL_CUMEM_ENABLE=1
export NCCL_DEBUG=INFO
```

meaning:

```text
NCCL_MNNVL_ENABLE:
  允许 NCCL 在可用时使用 Multi-Node NVLink。

NCCL_CUMEM_ENABLE:
  启用 NCCL 需要的 CUDA memory management path。

NCCL_DEBUG=INFO:
  让 NCCL 打印 topology/path/protocol 信息，方便确认有没有走 MNNVL。
```

Note: If the system, driver or fabric does not support it,`NCCL_MNNVL_ENABLE=1`will not force success and NCCL may be automatically disabled or fallbacked.

### 3. What is IMEX

IMEX = Internode Memory Exchange/Import-Export service. What it does is:

```text
在 Multi-Node NVLink 系统里，
管理跨 OS domain / 跨节点的 GPU memory export/import。
```

What it does is not data transfer itself, but memory mapping management:

```text
exporting node:
  GPU memory allocation
  create sharable handle
  VA -> PA -> fabric address mapping

importing node:
  通过 IMEX 获取 mapping 信息
  建立本进程可访问的远端 GPU memory mapping
```

intuition:

```text
NVLink fabric 负责数据读写路径；
IMEX 负责告诉各节点“哪些远端 GPU memory 可以被谁安全 map”。
```

### 4. Start IMEX service

Check on all compute nodes belonging to the same NVLink domain:

```bash
systemctl status nvidia-imex
```

If not enabled:

```bash
sudo systemctl enable nvidia-imex
sudo systemctl start nvidia-imex
```

If the configuration has been modified:

```bash
sudo systemctl restart nvidia-imex
```

In a container environment, it is usually necessary to confirm that the container can see the corresponding device and permissions.

### 5. List of nodes configured with IMEX domain

IMEX needs to know which nodes belong to the same memory sharing domain. Common configuration files:

```text
/etc/nvidia-imex/nodes_config.cfg
```

Put the IPs of all compute nodes in the same NVLink domain, for example:

```text
10.0.0.11
10.0.0.12
10.0.0.13
10.0.0.14
```

Some environments specify configuration files through variables:

```bash
export IMEX_NODE_CONFIG_FILE=/path/to/nodes_config.cfg
```

Require:

```text
每个 node 上的 node list 要一致
这些 IP 要能通过 TCP/gRPC 互相通信
IMEX service 要在 job 启动前完成初始化
```

### 6. IMEX channel permissions

Multi-node workload usually requires IMEX channel, usually channel 0:

```bash
ls -l /dev/nvidia-caps-imex-channels/
```

Expect to see something like:

```text
channel0
```

If there is no channel, or the job user/container does not have permissions, CUDA fabric memory / cross-node memory mapping may fail, with the following symptoms:

```text
permission denied
insufficient permission
CUDA_ERROR_NOT_SUPPORTED
fabric handle import/export failure
```

### 7. Verify NVLink / P2P / MNNVL

Basic check:

```bash
nvidia-smi
nvidia-smi nvlink --status
nvidia-smi topo -p2p p
```

Focus:

```text
GPU 是否都可见
NVLink link 是否 active
跨 host GPU 是否显示 P2P capable
```

NCCL verification:

```bash
NCCL_DEBUG=INFO \
NCCL_MNNVL_ENABLE=1 \
NCCL_CUMEM_ENABLE=1 \
mpirun -np 4 -H hostA:2,hostB:2 \
  ./build/all_reduce_perf -b 8M -e 1G -f 2 -g 1
```

Look at the NCCL log:

```text
是否识别出 MNNVL / NVLink path
是否 fallback 到 IB/RDMA
是否有 IMEX / CUMEM / fabric handle 报错
```

### 8. Common failure modes

The hardware is not part of the MNNVL platform:

```text
症状:
  跨 host 只能走 NIC/RDMA。

处理:
  不要期待 LSA 跨 host；用 NCCL collective / GPUDirect RDMA / GIN。
```

IMEX is not started or the node list is inconsistent:

```text
症状:
  cross-node fabric memory mapping 失败。

处理:
  检查 nvidia-imex service、nodes_config.cfg、节点互通性。
```

Insufficient IMEX channel permissions:

```text
症状:
  permission denied / insufficient permission。

处理:
  检查 /dev/nvidia-caps-imex-channels/channel0 权限和容器挂载。
```

NCCL/CUDA/driver version mismatch:

```text
症状:
  MNNVL 自动 disabled，或者 fallback 到 IB/RDMA。

处理:
  使用平台支持矩阵中匹配的 driver / CUDA / NCCL。
```

Environment variables are not enabled:

```text
症状:
  NCCL 没使用 MNNVL。

处理:
  设置 NCCL_MNNVL_ENABLE=1、NCCL_CUMEM_ENABLE=1，并打开 NCCL_DEBUG=INFO 验证。
```

### 9. Impact on Device API/LSA

If MNNVL/IMEX configuration is successful:

```text
devComm.lsaSize 可能覆盖跨 host GPUs
ncclTeamTagLsa() 可能跨 host 生效
ncclGetLsaPointer(...) 可能能拿到跨 host peer pointer
LSA barrier 可能同步跨 host LSA team
```

If configuration is unsuccessful:

```text
LSA team 通常只覆盖本 host 内 P2P 可达 GPUs
跨 host 要走 GIN / GPUDirect RDMA / 普通 NCCL collective
```

So the judgment criterion is not the sentence "whether there is NVLink on the hardware", but:

```text
CUDA/NCCL 是否真的把这些 GPUs 放进同一个 load/store accessible domain。
```

refer to:

- NVIDIA IMEX Service overview: <https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html>
- NVIDIA MNNVL overview: <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html>
- NVIDIA MNNVL verification: <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/verifying.html>
- NCCL environment variables: <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html>

## Practical Guide: NCCL code writing order

This section first sorts out the host-side writing method when NCCL Device API is not used normally, and then gives a DP all-reduce teaching skeleton using NCCL Device API.

The core model of Host-side NCCL is:

```text
CPU host 调 NCCL API
NCCL work enqueue 到 CUDA stream
GPU/runtime 按 stream/event 依赖执行通信
```

In other words, traditional NCCL does not directly call`ncclAllReduce`in the CUDA kernel. The normal way to write it is host-side enqueue:

```c
ncclAllReduce(..., comm, stream);
```

### 1. Host-side NCCL: distinguish the objects first

```text
cudaSetDevice:
  当前 CPU thread 正在操作哪张 GPU 的 CUDA context。

cudaStream_t stream:
  某一张 GPU 上的任务队列。
  决定 kernel / memcpy / NCCL work 在本 GPU 上的执行顺序。

cudaEvent_t event:
  stream 之间表达依赖和完成状态。
  例如 grad_ready、comm_done。

GPU buffer / pointer:
  真正存数据的 GPU memory。
  例如 grad、sendbuf、recvbuf。

ncclComm_t comm:
  NCCL 通信组上下文。
  决定这个 rank 和哪些 ranks 通信，以及 NCCL 如何理解 topology/path/protocol。
```

`communicator`can be understood as the "communication group chat context" of NCCL. It is not the data itself, nor the CUDA execution queue, but a reusable communication group description:

```text
这个 group 里有哪些 ranks / GPUs
当前 rank 在 group 内编号是多少
group size 是多少
rank 到 GPU / host 的映射是什么
GPU 之间 topology 是什么
同机走 NVLink / NVSwitch / PCIe 还是 shared memory
跨机走 IB / RDMA / network plugin 还是 MNNVL
NCCL 可以用哪些 rings / trees / channels / protocols
```

so:

```c
ncclAllReduce(..., dp_comm, stream);
```

means:

```text
在 dp_comm 这个通信组内做 all-reduce；
把这次 all-reduce work enqueue 到 stream 上执行。
```

In one sentence:

```text
cudaSetDevice 决定当前操作环境。
stream 决定什么时候执行。
event 决定 stream 之间谁等谁。
buffer 是真正的数据。
communicator 决定和谁通信。
```

### 2. Code logical sequence

Writing NCCL programs can be thought of in this order:

```text
1. 确认硬件/topology
   NVLink / NVSwitch / PCIe / IB / RDMA / MNNVL

2. 建立 rank -> GPU 映射
   rank 0 -> GPU0
   rank 1 -> GPU1
   ...

3. 对当前 GPU 调 cudaSetDevice

4. 在当前 GPU 上创建 stream、分配 buffer

5. 初始化 NCCL communicator
   global communicator 或 DP/TP/PP/EP group communicator

6. 在 compute_stream 上 launch compute kernel

7. 用 event 表示 data ready
   comm_stream 等 compute_stream 的 ready event

8. host 调 NCCL API，把通信 enqueue 到 comm_stream

9. 用 event 表示 comm done
   compute_stream 等 comm_stream 的 done event

10. 后续 kernel 使用通信结果
```

Some details:
```
cudaEventSynchronize(event):
  CPU 等 event 完成。

cudaStreamSynchronize(stream):
  CPU 等 stream 完成。

cudaStreamWaitEvent(stream, event):
  GPU stream 等 event，CPU 不等。
```
### 3. Single process single GPU/rank: DP all-reduce example

This is the most common distributed training perspective:

```text
每个 process 控制一张 GPU
每个 process 是一个 global rank
每个 rank 只有自己的 ncclComm_t handle
```

Assumptions:

```text
global ranks: 0 1 2 3 4 5 6 7
tp_size = 2
dp_size = 4

DP group for tp_rank 0: [0, 2, 4, 6]
DP group for tp_rank 1: [1, 3, 5, 7]
```

rank computes its own local rank:

```c
int global_rank = ...;
int world_size  = ...;
int local_gpu   = ...;

int tp_size = 2;
int dp_size = world_size / tp_size;

int tp_rank = global_rank % tp_size;
int dp_rank = global_rank / tp_size;
```

Initialize the CUDA resources of this GPU:

```c
cudaSetDevice(local_gpu);

cudaStream_t compute_stream;
cudaStream_t comm_stream;
cudaStreamCreate(&compute_stream);
cudaStreamCreate(&comm_stream);

float* grad;
cudaMalloc(&grad, bytes);
```

Create a communicator of the DP group to which the current rank belongs:

```c
ncclUniqueId dp_id;

if (dp_rank == 0) {
    ncclGetUniqueId(&dp_id);
}

// 需要用 MPI / TCP / torch launcher 等方式，
// 把 dp_id broadcast 给同一个 DP group 内的 ranks。
broadcast_to_dp_group(&dp_id, tp_rank);

ncclComm_t dp_comm;
ncclCommInitRank(
    &dp_comm,
    dp_size,   // DP group size，不是 global world_size
    dp_id,
    dp_rank    // 当前 rank 在 DP group 内的 rank
);
```

When`ncclCommInitRank`creates`dp_comm`, the background mainly does these things. The following is a conceptual model; the specific implementation of NCCL will vary with version, hardware, driver, network plugin and environment variables.

```text
1. 应用层先决定 group membership:
   例如 global ranks [0,2,4,6] 组成一个 DP group。
   这一步不是 NCCL 自动替你决定的，而是你的训练框架 / launcher 决定的。

2. 生成并分发 dp_id:
   DP group 内某一个 rank 调 ncclGetUniqueId(&dp_id)。
   然后应用用 MPI / TCP store / torch distributed store / socket 等 CPU-side 机制
   把同一个 dp_id 发给这个 DP group 的所有 ranks。
   dp_id 的作用是让这些 ranks 知道自己属于同一个 NCCL communicator clique。

3. 每个 rank 设置自己的 CUDA device:
   cudaSetDevice(local_gpu) 必须在 ncclCommInitRank 前完成。
   NCCL 会把这个 communicator handle 绑定到当前 CUDA device。
   同一个 NCCL communicator 里，不应该把同一张 CUDA device 当成多个不同 rank 使用，
   否则可能 hang。

4. 参数一致性和局部 rank 编号:
   每个成员都调用:
     ncclCommInitRank(&dp_comm, dp_size, dp_id, dp_rank)
   dp_size 必须一致。
   dp_rank 必须在 [0, dp_size) 之间，并且在这个 communicator 内唯一。
   例如 global ranks [0,2,4,6] 会映射成 dp ranks [0,1,2,3]。

5. Bootstrap / rendezvous:
   所有 DP group 成员通过 dp_id 进入同一个 init 协议。
   NCCL 建立 CPU-side bootstrap control path，用来交换初始化 metadata。
   这个阶段类似“大家先碰头、互相报地址、确认谁是谁”。
   如果某个 rank 没调用、dp_id 不一致、dp_size 不一致、rank 重复，
   常见结果就是 init 卡住或返回错误。

6. 交换基础身份信息:
   NCCL 会在 ranks 之间交换 host / process / GPU / network 相关信息，例如：
     hostname 或 host hash
     rank id
     CUDA device
     GPU PCI bus id
     GPU architecture / capability
     local rank / node 内 rank 信息
     可见 network interfaces / NIC / HCA 信息
     NCCL version 和相关 capability
   这些信息不是 gradient 数据，而是后续选择通信路径需要的 metadata。

7. Topology discovery:
   NCCL 会理解 GPU 和 NIC 的物理连接关系，例如：
     GPU-GPU: NVLink / NVSwitch / PCIe / P2P 可达性
     GPU-NIC: 哪张 GPU 离哪个 NIC 更近
     NUMA / PCIe root complex / switch 层级
     container 或 VM 里暴露出来的 /sys PCI topology
   如果 /sys 不完整，或者容器/VM 暴露的是虚拟 topology，
   NCCL 可能选出次优路径。

8. 选择 transport:
   同机可能选择：
     P2P / NVLink / NVSwitch / CUDA IPC
     PCIe P2P
     SHM fallback
   跨机可能选择：
     NET/IB / GPUDirect RDMA
     NCCL net plugin
     NET/Socket fallback
   MNNVL 平台可能还会考虑：
     Multi-Node NVLink
     IMEX domain
     cuMem / symmetric allocation 相关路径

9. 建立或准备 peer connections:
   NCCL 会为 rank pairs / channels 准备连接状态。
   有些连接资源可能在 init 阶段建立。
   有些可能是 lazy 的，在第一次 collective / send-recv 时才真正完成。
   所以 init 成功不代表所有数据路径已经传过真实 payload。

10. 选择 algorithm / protocol / channels:
    NCCL 会根据 topology、message pattern、环境变量和内部规则准备通信结构，例如：
      rings
      trees
      channels
      CTAs / channel count
      LL / LL128 / Simple protocol
      NVLS / CollNet 等可选路径
    后续 ncclAllReduce 不是每次重新发明通信拓扑，
    而是在这个 communicator 的基础上选择/复用合适的结构。

11. 分配 host/device/internal resources:
    可能包括：
      communicator 内部状态
      channel metadata
      proxy thread / proxy state
      shared memory segment
      network connection state
      CUDA event / stream 相关内部状态
      transport buffers / registration metadata
    所以 communicator 不是一个轻量纯编号；
    创建太多 communicator 会增加 memory 和 setup 开销。

12. 返回 opaque handle:
    dp_comm 是一个 NCCL opaque communicator handle。
    应用不直接读里面的数据结构。
    后续 ncclAllReduce / ncclSend / ncclRecv 通过它知道：
      这个 rank 是谁
      要和哪些 peers 通信
      大致应该走哪些 transport / channels / algorithms
```

Note: Creating`dp_comm`will not pass gradient and will not execute all-reduce. Real communication occurs when:

```c
ncclAllReduce(..., dp_comm, stream);
```

That is to say:

```text
ncclCommInitRank:
  建 group、认 peers、发现 topology、准备 transport/algorithm/resource。

ncclAllReduce:
  真正把 gradient 数据 enqueue 到 CUDA stream 上执行通信。
```

If you want to confirm what is happening in the above init phases, you can open:

```bash
NCCL_DEBUG=INFO \
NCCL_DEBUG_SUBSYS=INIT,BOOTSTRAP,GRAPH,NET,P2P,SHM,TUNING,ENV,ALLOC \
NCCL_DEBUG_FILE=/tmp/nccl.%h.%p.log \
./your_program
```

These subsystems roughly correspond to:

```text
BOOTSTRAP:
  rank rendezvous 和初始化控制路径。

INIT:
  communicator 初始化过程。

GRAPH:
  topology detection 和 graph search。

NET / P2P / SHM:
  选了网络、GPU peer-to-peer，还是 shared-memory path。

TUNING:
  algorithm/protocol/channel tuning。

ALLOC:
  初始化中发生的资源分配。
```

Training step:

```c
cudaEvent_t grad_ready;
cudaEvent_t comm_done;
cudaEventCreate(&grad_ready);
cudaEventCreate(&comm_done);

backward_kernel<<<grid, block, 0, compute_stream>>>(grad);

cudaEventRecord(grad_ready, compute_stream);
cudaStreamWaitEvent(comm_stream, grad_ready, 0);

ncclAllReduce(
    grad,
    grad,
    count,
    ncclFloat,
    ncclSum,
    dp_comm,       // 只在 DP group 内同步 gradient
    comm_stream
);

cudaEventRecord(comm_done, comm_stream);
cudaStreamWaitEvent(compute_stream, comm_done, 0);

optimizer_kernel<<<grid, block, 0, compute_stream>>>(grad);
```

Key points:

```text
global_rank:
  全局 rank 编号。

dp_rank:
  当前 rank 在 DP group 内的编号。

dp_comm:
  只代表当前 DP group。
  data-parallel gradient all-reduce 应该用 dp_comm，而不是 global communicator。
```

### 4. Device-side NCCL: objects are distinguished first

When using the NCCL Device API, in addition to the host-side`cudaSetDevice`,`stream`,`event`,`buffer`, and`comm`, these objects must also be distinguished:

```text
ncclWindow_t win:
  注册过的 symmetric memory window。
  它不是新内存，而是 communicator 内对某段 buffer 的登记表。
  kernel 后面通过 window 找到 peer rank 的 buffer。

ncclDevComm devComm:
  device-side communicator。
  由 host-side ncclComm_t 创建，但传进 CUDA kernel 使用。
  kernel 通过它知道 rank、LSA team、device-side resources。

ncclDevCommRequirements reqs:
  创建 devComm 前声明需要哪些 device-side resources。
  例如 lsaBarrierCount、GIN signals/barriers、multimem capability。

ncclTeam / ncclTeamTag:
  communicator 里的 rank 子集。
  World team 是整个 communicator。
  LSA team 是 load/store accessible peers。
  Rail team 是 network rail 上对应的 peers。

ncclCoopCta / ncclCoopWarp / ncclCoopThread:
  device-side operation 的本地 thread cooperative scope。
  ncclCoopCta 表示当前 CUDA block / CTA 作为协作单位。

ncclLsaBarrierSession:
  CUDA kernel 内跨 LSA peers 的 barrier session。
  它不是 CUDA event，也不是 stream synchronization。

barrier slot / index:
  例如 blockIdx.x。
  决定当前 CTA 使用哪个 LSA barrier resource。
```

The most confusing relationship:

```text
ncclComm_t:
  CPU 用，host-side NCCL API 用。

ncclDevComm:
  GPU kernel 用，device-side NCCL API 用。

ncclWindow_t:
  不是通信组，而是 registered memory view。

cudaStream_t:
  决定什么时候执行。

cudaEvent_t:
  决定 stream 之间谁等谁。

ncclTeamTagLsa:
  决定 communicator 里的哪个 peer subset 参与 device-side 操作。
```

Corresponds to the following LSA kernel:

```text
devComm:
  告诉 kernel 当前 rank / LSA team / barrier resources。

win:
  告诉 kernel 每个 peer 的 grad buffer 在哪里。

ncclCoopCta:
  告诉 NCCL 这次操作由当前 CUDA block 协作完成。

ncclTeamTagLsa:
  告诉 NCCL 只在 LSA-accessible peers 内同步。

blockIdx.x:
  选择当前 block 使用的 barrier slot。
```

### 5. Single process single GPU/rank: Device-side NCCL DP all-reduce example

If the NCCL Device API is used, the structure of the same DP all-reduce will become:

```text
host:
  创建 dp_comm
  分配 symmetric memory
  注册 NCCL window
  创建 ncclDevComm
  launch 自定义 CUDA kernel

device kernel:
  用 NCCL device primitives 做跨 GPU 同步和 remote load/store
```

This is not called directly in the kernel:

```c
// 不是这样
__global__ void k() {
    ncclAllReduce(...);
}
```

Instead, use the building blocks provided by the NCCL Device API to write the communication kernel yourself.

Below is the pedagogical skeleton of the LSA path. It assumes that the GPUs in the current DP group are all load/store accessible, such as the same NVLink/NVSwitch/CUDA P2P domain.

If the DP group spans ordinary IB/RDMA nodes, this LSA kernel cannot be directly applied; either continue to use the host-side NCCL collective, or study the GIN / network-capable device-side path.

Host setup：

```c
int global_rank = ...;
int world_size  = ...;
int local_gpu   = ...;

int tp_size = 2;
int dp_size = world_size / tp_size;

int tp_rank = global_rank % tp_size;
int dp_rank = global_rank / tp_size;

cudaSetDevice(local_gpu);

cudaStream_t compute_stream;
cudaStream_t comm_stream;
cudaStreamCreate(&compute_stream);
cudaStreamCreate(&comm_stream);

size_t bytes = count * sizeof(float);

// 1. 创建当前 DP group 的 host-side communicator
ncclUniqueId dp_id;

if (dp_rank == 0) {
    ncclGetUniqueId(&dp_id);
}

broadcast_to_dp_group(&dp_id, tp_rank);

ncclComm_t dp_comm;
ncclCommInitRank(
    &dp_comm,
    dp_size,
    dp_id,
    dp_rank
);

// 2. 分配 symmetric memory。
// Device API 的 window buffer 要用 NCCL 支持的 symmetric allocation。
float* grad;
ncclMemAlloc((void**)&grad, bytes);

// 3. 注册 NCCL window。
ncclWindow_t grad_win;
ncclCommWindowRegister(
    dp_comm,
    grad,
    bytes,
    &grad_win,
    NCCL_WIN_COLL_SYMMETRIC
);

// ncclComm_t: 通信群聊
// grad: 你本地手里的文件
// ncclWindow_t: 群聊里登记过的一组共享文件位置表
// ncclGetLsaPointer: 根据这个表，找到某个成员那份文件的地址

// 4. 创建 device communicator。
ncclDevComm dev_dp_comm;
ncclDevCommRequirements reqs = NCCL_DEV_COMM_REQUIREMENTS_INITIALIZER;

int nCTAs = 16;
reqs.lsaBarrierCount = nCTAs;

ncclDevCommCreate(dp_comm, &reqs, &dev_dp_comm);

// dp_comm:
//   host-side communicator
//   CPU 代码用它调用 NCCL API

// dev_dp_comm:
//   device-side communicator
//   GPU kernel 用它做 device-initiated communication

```


Training step：

```c
cudaEvent_t grad_ready;
cudaEvent_t comm_done;
cudaEventCreate(&grad_ready);
cudaEventCreate(&comm_done);

// 1. compute stream 上产生 gradient。
backward_kernel<<<grid, block, 0, compute_stream>>>(grad);

// 2. comm stream 等 gradient ready。
cudaEventRecord(grad_ready, compute_stream);
cudaStreamWaitEvent(comm_stream, grad_ready, 0);

// 3. launch device-side NCCL all-reduce kernel。
// nCTAs 个 CUDA blocks
// 每个 block 512 threads
// 0 是 dynamic shared memory size
dp_lsa_allreduce_kernel<float><<<nCTAs, 512, 0, comm_stream>>>(
    dev_dp_comm,
    grad_win,
    0,       // byte offset inside window
    count
);

// 4. compute stream 等 device-side communication 完成。
cudaEventRecord(comm_done, comm_stream);
cudaStreamWaitEvent(compute_stream, comm_done, 0);

optimizer_kernel<<<grid, block, 0, compute_stream>>>(grad);
```

Device kernel：

```c
template <typename T>
__global__ void dp_lsa_allreduce_kernel(
    ncclDevComm devComm,
    ncclWindow_t win,
    size_t offset,
    size_t count
) {
    ncclLsaBarrierSession<ncclCoopCta> bar {
        ncclCoopCta(),
        devComm,
        ncclTeamTagLsa(),
        blockIdx.x
    };

    // 等 DP group 内 LSA peers 的对应 CTA 到达。
    // acquire 约束后续 remote reads，避免过早读 peer memory。
    bar.sync(ncclCoopCta(), cuda::memory_order_acquire);

    int lsa_rank = devComm.lsaRank;
    int lsa_size = devComm.lsaSize;

    int global_tid =
        threadIdx.x + blockDim.x * (lsa_rank + blockIdx.x * lsa_size);

    int global_nthreads =
        blockDim.x * gridDim.x * lsa_size;

    for (size_t i = global_tid; i < count; i += global_nthreads) {
        T sum = 0;

        // 读每个 LSA peer 的同一个 gradient element。
        for (int peer = 0; peer < lsa_size; peer++) {
            T* peer_grad = (T*)ncclGetLsaPointer(win, offset, peer);
            sum += peer_grad[i]; 
            # load happen
            # GPU 硬件的 memory coalescer / memory subsystem
            # CUDA threads
			#   -> warp instruction（32）
			#   -> per-thread addresses
			#   -> SM load/store unit
			#   -> memory coalescer
			#   -> L1/L2/cache fabric
			#   -> local HBM or NVLink/NVSwitch remote memory transaction 
        }

        // 写回所有 peer，形成 in-place all-reduce。
        for (int peer = 0; peer < lsa_size; peer++) {
            T* peer_grad = (T*)ncclGetLsaPointer(win, offset, peer);
            peer_grad[i] = sum;
        }
    }

    // release 约束前面的 writes，让后续用户看到完成后的结果。
    bar.sync(ncclCoopCta(), cuda::memory_order_release);
}
```

The semantics of this code are:

```text
dp_comm:
  host-side DP group communicator，只用于 setup。

dev_dp_comm:
  device-side communicator，传进 CUDA kernel。

grad_win:
  symmetric memory window，kernel 通过它找 peer memory。

ncclGetLsaPointer:
  拿到某个 LSA peer 的 window pointer。

ncclLsaBarrierSession:
  在 kernel 内做跨 rank / 跨 CTA 的 LSA barrier。
```

Note that this kernel is a mechanism teaching version, not a high-performance all-reduce implementation. Real implementations typically use:

```text
chunking
vectorized/coalesced load-store
warp/block cooperative access
multimem / NVLink SHARP
NCCL remote reduce/copy building blocks
GIN for network device-side communication
```

### 6. Single process multiple GPUs: one CPU process controls multiple GPUs

This writing method is often used for stand-alone testing or some internal runtime scenarios:

```text
一个 CPU process 持有多个 local GPU 的 stream/buffer/communicator handles。
每张 GPU 仍然对应一个 NCCL rank。
```

For example, a process controls 8 GPUs:

```c
int world_size = 8;
int devs[8] = {0,1,2,3,4,5,6,7};

cudaStream_t compute_streams[8];
cudaStream_t comm_streams[8];
float* grads[8];

for (int g = 0; g < world_size; g++) {
    cudaSetDevice(devs[g]);

    cudaStreamCreate(&compute_streams[g]);
    cudaStreamCreate(&comm_streams[g]);

    cudaMalloc(&grads[g], bytes);
}
```

If it is a global communicator, you can directly:

```c
ncclComm_t comms[8];
ncclCommInitAll(comms, world_size, devs);
```

The mapping relationship is:

```text
comms[0] -> rank 0 / GPU0
comms[1] -> rank 1 / GPU1
...
comms[7] -> rank 7 / GPU7
```

Note:`comms[1]`does correspond to rank 1, but it is not an "object in another process". It is the local handle that controls GPU1/rank1 in the current process.

Do global all-reduce:

```c
ncclGroupStart();

for (int g = 0; g < world_size; g++) {
    cudaSetDevice(devs[g]);

    ncclAllReduce(
        grads[g],
        grads[g],
        count,
        ncclFloat,
        ncclSum,
        comms[g],
        comm_streams[g]
    );
}

ncclGroupEnd();
```

Why is there still`cudaSetDevice(g)`in the loop:

```text
stream 已经属于某张 GPU，但 CUDA/NCCL API 仍依赖当前 host thread 的 current device。
操作 GPU g 的 pointer / stream / comm 前，应该先 cudaSetDevice(g)。
```

### 7. Single process multiple GPUs: Create DP group communicator

If there is TP/DP mixed parallelism, global communicator cannot be used for all gradients.

Assumptions:

```text
global GPUs: 0 1 2 3 4 5 6 7
tp_size = 2
dp_size = 4

DP group 0: [0, 2, 4, 6]
DP group 1: [1, 3, 5, 7]
```

Create two DP communicator groups:

```c
int dp0_devs[4] = {0,2,4,6};
int dp1_devs[4] = {1,3,5,7};

ncclComm_t dp0_comms[4];
ncclComm_t dp1_comms[4];

ncclCommInitAll(dp0_comms, 4, dp0_devs);
ncclCommInitAll(dp1_comms, 4, dp1_devs);
```

Mapping relationship:

```text
dp0_comms[0] -> GPU0 在 DP group 0 里的 rank 0
dp0_comms[1] -> GPU2 在 DP group 0 里的 rank 1
dp0_comms[2] -> GPU4 在 DP group 0 里的 rank 2
dp0_comms[3] -> GPU6 在 DP group 0 里的 rank 3

dp1_comms[0] -> GPU1 在 DP group 1 里的 rank 0
dp1_comms[1] -> GPU3 在 DP group 1 里的 rank 1
dp1_comms[2] -> GPU5 在 DP group 1 里的 rank 2
dp1_comms[3] -> GPU7 在 DP group 1 里的 rank 3
```

Do DP all-reduce:

```c
ncclGroupStart();

for (int i = 0; i < 4; i++) {
    int gpu = dp0_devs[i];
    cudaSetDevice(gpu);

    ncclAllReduce(
        grads[gpu],
        grads[gpu],
        count,
        ncclFloat,
        ncclSum,
        dp0_comms[i],
        comm_streams[gpu]
    );
}

for (int i = 0; i < 4; i++) {
    int gpu = dp1_devs[i];
    cudaSetDevice(gpu);

    ncclAllReduce(
        grads[gpu],
        grads[gpu],
        count,
        ncclFloat,
        ncclSum,
        dp1_comms[i],
        comm_streams[gpu]
    );
}

ncclGroupEnd();
```

Key rules:

```text
data parallel gradient sync:
  用 dp_comm。

tensor parallel layer-internal communication:
  用 tp_comm。

pipeline stage activation send/recv:
  用 pp_comm 或 P2P communicator。

expert parallel / MoE all-to-all:
  用 ep_comm。
```

### 8. Practical checklist

Check each item one by one when writing code:

```text
1. 当前 CPU thread 有没有 cudaSetDevice 到正确 GPU？
2. pointer 是不是属于当前 GPU？
3. stream 是不是在当前 GPU 上创建的？
4. communicator handle 是不是对应当前 GPU/rank？
5. collective 是否用对 group communicator？
6. 所有 rank 是否按相同顺序调用 collective？
7. 数据 ready 前通信是否被 event 阻止？
8. 使用通信结果前是否等待了 comm_done？
9. 多 GPU 单进程时是否用 ncclGroupStart / ncclGroupEnd 包住多卡 NCCL launch？
10. DP/TP/PP/EP 是否分别创建了自己的 communicator？
```

The shortest mental model:

```text
cudaSetDevice:
  我现在操作哪张 GPU。

stream:
  这张 GPU 上什么时候执行。

communicator:
  这次操作和哪些 rank 通信。
```

### 9. Debug: How to measure the link speed after communicator init

After the success of`ncclCommInitRank`, the first thing is not to directly doubt the device-side kernel, but to first confirm whether the`dp_comm`itself can run out of a reasonable bandwidth.

The most practical order is:

```text
1. 用同一个 dp_comm 跑 host-side ncclAllReduce benchmark。
2. 打开 NCCL debug log，看 NCCL 选了什么 transport。
3. 如果 host-side NCCL 慢，再查 topology / rank mapping / RDMA。
4. 如果 host-side NCCL 快，但 device-side kernel 慢，再查自己写的 kernel。
```

#### 9.1 Use the same`dp_comm`to measure the upper limit of host-side NCCL

The questions answered in this step are:

```text
这个 DP communicator 对应的 ranks / GPUs / network path 是否正常？
NCCL 给这个 group 选出来的通信路径大概能跑到多少？
```

Test code:

```c
// dp_comm 已经 ncclCommInitRank 完成。
// grad 可以是 cudaMalloc 分配的普通 GPU buffer；
// 如果后面要接 Device API，也可以直接用 ncclMemAlloc 的 buffer。

cudaStream_t test_stream;
cudaStreamCreate(&test_stream);

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

int warmup = 5;
int iters = 20;

for (int i = 0; i < warmup; i++) {
    ncclAllReduce(
        grad,
        grad,
        count,
        ncclFloat,
        ncclSum,
        dp_comm,
        test_stream
    );
}
cudaStreamSynchronize(test_stream);

cudaEventRecord(start, test_stream);

for (int i = 0; i < iters; i++) {
    ncclAllReduce(
        grad,
        grad,
        count,
        ncclFloat,
        ncclSum,
        dp_comm,
        test_stream
    );
}

cudaEventRecord(stop, test_stream);
cudaEventSynchronize(stop);

float total_ms = 0.0f;
cudaEventElapsedTime(&total_ms, start, stop);

double avg_ms = total_ms / iters;
double bytes = count * sizeof(float);
double sec = avg_ms / 1000.0;

double algbw = bytes / sec / 1e9;
double busbw = algbw * 2.0 * (dp_size - 1) / dp_size;

if (dp_rank == 0) {
    printf("DP comm allreduce: avg %.3f ms, algbw %.2f GB/s, busbw %.2f GB/s\n",
           avg_ms, algbw, busbw);
}
```

here:

```text
algbw:
  从算法视角看，一个 rank 的 buffer 被 reduce 的速度。

busbw:
  估算底层链路实际承载的流量。
  all-reduce 常用 busbw = algbw * 2 * (N - 1) / N。
```

Be careful to use a large enough message size, for example:

```text
256 MB
512 MB
1 GB
```

Small messages are mainly used to measure latency and are not suitable for determining whether the link bandwidth is full.

#### 9.2 Open NCCL log and see what path it chooses

When running benchmark add:

```bash
NCCL_DEBUG=INFO \
NCCL_DEBUG_SUBSYS=INIT,GRAPH,NET,COLL,P2P,TUNING \
NCCL_DEBUG_FILE=/tmp/nccl.%h.%p.log \
./your_program
```

Look at the keywords in the log:

```text
P2P / NVL / NVLS:
  单机 NVLink / NVSwitch / P2P path，通常是好信号。

NET/IB:
  跨机走 InfiniBand / RDMA，通常是好信号。

NET/Socket:
  走 TCP/socket fallback，跨机时通常会慢很多。

SHM:
  走 host shared memory path，不一定是你想要的 GPU direct path。
```

If you expect cross-machine RDMA but see`NET/Socket`in the log, check the network interface / IB / GDRDMA configuration first instead of changing the kernel.

#### 9.3 If the host-side NCCL benchmark is slow

Check rank mapping first. Print for each rank:

```c
int dev;
cudaGetDevice(&dev);

cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, dev);

printf("global_rank=%d dp_rank=%d local_gpu=%d cuda_dev=%d pci=%04x:%02x:%02x.%x\n",
       global_rank,
       dp_rank,
       local_gpu,
       dev,
       prop.pciDomainID,
       prop.pciBusID,
       prop.pciDeviceID,
       0);
```

To confirm:

```text
每个 rank 绑定到唯一 GPU。
dp_comm 里的 ranks 正好是当前 DP group。
local_gpu / CUDA_VISIBLE_DEVICES / cudaSetDevice 没有错位。
DP group 的 rank 顺序符合你预期的 topology。
```

Check the single-machine GPU-GPU topology again:

```bash
nvidia-smi topo -m
nvidia-smi topo -p2p p
nvidia-smi topo -p2p r
nvidia-smi topo -p2p w
nvidia-smi nvlink --status
```

Confirm the physical link with bare P2P tools:

```bash
./p2pBandwidthLatencyTest
nvbandwidth
```

judge:

```text
p2pBandwidthLatencyTest 也慢:
  GPU-GPU physical path / P2P / NVLink / PCIe / ACS / container topology 有问题。

p2pBandwidthLatencyTest 快，但 ncclAllReduce 慢:
  更可能是 NCCL path selection / rank mapping / communicator group / env config 问题。
```

Check RDMA again when crossing multiple machines:

```bash
ibstat
ibv_devinfo
lsmod | grep nvidia_peermem
ib_write_bw <peer_host>
```

FAQ:

```text
nvidia-peermem 没加载。
NCCL 自动选错 NIC。
NCCL 走了 TCP/socket fallback。
GPU 到 NIC 的 PCIe 距离太远。
RoCE GID / PFC / routing 配置不对。
container 里 /sys PCI topology 不完整。
/dev/shm 或 memlock 太小。
```

You can temporarily specify an interface to locate:

```bash
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=mlx5_0
```

If the speed recovers after specifying`NCCL_IB_HCA`, it means that NCCL automatically selected the wrong NIC before or the NIC affinity was not good.

#### 9.4 If host-side NCCL is fast but device-side kernel is slow

This shows that there is a high probability that there is no problem with the communicator and the basic link. The problem is more likely to be in the Device API kernel you wrote.

For the previous LSA example, first check the LSA coverage:

```c
__global__ void dump_lsa_info(ncclDevComm devComm, int* out) {
    if (blockIdx.x == 0 && threadIdx.x == 0) {
        out[0] = devComm.lsaRank;
        out[1] = devComm.lsaSize;
    }
}
```

host side:

```c
int* d_info;
int h_info[2];

cudaMalloc(&d_info, 2 * sizeof(int));

dump_lsa_info<<<1, 1, 0, comm_stream>>>(dev_dp_comm, d_info);
cudaMemcpyAsync(
    h_info,
    d_info,
    2 * sizeof(int),
    cudaMemcpyDeviceToHost,
    comm_stream
);
cudaStreamSynchronize(comm_stream);

printf("dp_rank=%d lsaRank=%d lsaSize=%d dp_size=%d\n",
       dp_rank, h_info[0], h_info[1], dp_size);
```

Key judgments:

```text
lsaSize == dp_size:
  LSA peers 覆盖整个 DP group，可以继续看 kernel 性能。

lsaSize < dp_size:
  LSA 只覆盖部分 GPU。
  这个 LSA device-side 示例不能代表完整 DP all-reduce。
```

Then use CUDA event to test the device-side kernel separately:

```c
cudaEvent_t k_start, k_stop;
cudaEventCreate(&k_start);
cudaEventCreate(&k_stop);

cudaEventRecord(k_start, comm_stream);

dp_lsa_allreduce_kernel<float><<<nCTAs, 512, 0, comm_stream>>>(
    dev_dp_comm,
    grad_win,
    0,
    count
);

cudaEventRecord(k_stop, comm_stream);
cudaEventSynchronize(k_stop);

float kernel_ms = 0.0f;
cudaEventElapsedTime(&kernel_ms, k_start, k_stop);

double sec = kernel_ms / 1000.0;
double algbw = count * sizeof(float) / sec / 1e9;
double busbw = algbw * 2.0 * (dp_size - 1) / dp_size;

printf("device-side LSA allreduce: %.3f ms, algbw %.2f GB/s, busbw %.2f GB/s\n",
       kernel_ms, algbw, busbw);
```

If host-side NCCL is fast but the kernel is slow, focus on:

```text
nCTAs 是否太少。
block size 是否太小。
remote load/store 是否 coalesced。
每个 element 是否被所有 rank 反复 remote read，导致热点。
barrier 次数是否太多。
是否需要 chunking / vectorized load-store / pipeline。
是否应该改成 reduce-scatter + all-gather 风格，而不是每个 rank 读所有 peer。
```

The previous teaching kernel is to understand`ncclGetLsaPointer`and remote load/store, not the high-performance all-reduce algorithm. Really high-performance implementations typically do sharding, pipelines, coalesced access, and more granular synchronization.

#### 9.5 Quick positioning table

| Phenomenon | More likely problems | Next steps |
|---|---|---|
| host-side`ncclAllReduce`is slow,`p2pBandwidthLatencyTest`is also slow | GPU-GPU physical path / P2P / NVLink / PCIe / ACS / container topology | Prerequisite topology / driver / container exposure |
| host-side`ncclAllReduce`is slow, but bare P2P is fast | NCCL path selection / rank mapping / communicator group / env config | Look at NCCL log, check`cudaSetDevice`and rank mapping |
| Cross-machine host-side NCCL is slow,`ib_write_bw`is also slow | Network/RDMA itself is slow | Check IB/RoCE/NIC/PFC/routing |
| Cross-machine`ib_write_bw`is fast, but NCCL is slow | NCCL did not select the correct NIC / did not use GDRDMA / GPU-NIC affinity is poor | Check`NET/IB`vs`NET/Socket`, specify`NCCL_IB_HCA`|
| host-side NCCL is fast, device-side kernel is slow | self-written Device API algorithm is slow | check`nCTAs`, access pattern, barrier, chunking |
|`devComm.lsaSize < dp_size`| LSA peer domain is smaller than DP group | Cannot use this LSA example to do complete DP all-reduce |

The shortest conclusion:

```text
先用 dp_comm 测 host-side NCCL 上限。
再用 NCCL log 判断 transport。
再用 p2pBandwidthLatencyTest / ib_write_bw 分离物理链路问题。
最后才调 device-side kernel。
```


## Correspondence in LLM Training

| LLM parallelism | Common communications | Typical tools | Main issues |
| -------------------------------- | ------------------------------------------------ | -------------------------------- | ------------------------------------------------ |
| Data parallel | all-reduce, reduce-scatter | NCCL, FSDP, DeepSpeed | gradient synchronization, overlap, optimizer state memory |
| Tensor parallel | all-gather, all-reduce, reduce-scatter | NCCL, Megatron-LM | Each layer communicates frequently, latency and bandwidth are important |
| Pipeline parallel | send/recv activation and gradient | NCCL p2p, torch distributed | pipeline bubble, stage imbalance, sequence matching |
| FSDP / ZeRO | parameter all-gather, gradient reduce-scatter | PyTorch FSDP, DeepSpeed ZeRO | Save memory but increase communication |
| Expert parallel / MoE | all-to-all, token dispatch/combine | NCCL, NVSHMEM, custom kernels | small messages, dynamic routing, fine-grained sync |
| Context/sequence parallel | all-to-all, ring exchange, KV exchange | NCCL/custom kernels | activation/KV communication under long context |

## How to choose a tool

### Cases where NCCL is preferred

Use standard collective:

```text
all-reduce
reduce-scatter
all-gather
broadcast
all-to-all style pattern
```

Especially the main path of LLM training. NCCL's topology and protocol selections should be trusted by default.

### Cases where NVSHMEM is preferred

need:

```text
device-initiated communication
CUDA kernel 内 put/get
remote pointer
fused compute + communication
custom MoE dispatch/combine
细粒度或 irregular communication
```

NVSHMEM is more flexible, but requires more responsibility for synchronization and correctness.

### Cases where MPI is preferred

need:

```text
CPU/HPC portability
复杂 datatype
tags / wildcard
已有 MPI codebase
非 GPU 专用系统
```

MPI is strong, but complex semantics on the GPU will affect the optimization space.

### Custom communication situations

Only if you know for sure:

```text
topology
message sizes
repeat pattern
latency/bandwidth bottleneck
NCCL/NVSHMEM 默认行为的问题
```

It's worth handwriting schedule, multi-stream, custom kernel or specialized protocol.

## Checking questions while studying

When seeing a multi-GPU communication design, ask:

1. Is this a two-sided, one-sided, collective, or direct pointer?
2. Who knows the remote address?
3. Does the remote end have to post receive at the same time?
4. Are synchronization and data movement tied together, or separate?
5. Is the communication persistent?
6. Can the setup cost be amortized?
7. Should the data path take NVLink, PCIe, IB/RDMA, or host staging?
8. Is there any background progress?
9. Can GPU compute overlap with communication?
10. How much temporary memory is required?
11. Are small messages coalesced?
12. If a certain rank is slow, will everyone wait?
13. How to ensure completion before using the results?

## The most important mental model

```text
Two-sided:
  send/recv 匹配数据。
  问题是 matching 和等待。

One-sided:
  initiator 直接 put/get 远端 memory。
  问题是同步正确性。

Collective:
  一组 GPU 执行预定义全局操作。
  问题是所有 rank 顺序一致和 straggler。

Direct pointer:
  远端 GPU memory 像 pointer 一样访问。
  问题是 topology、coalescing 和 memory consistency。
```

The core takeaway of this session:

```text
LLM/HPC 通信模式高度重复。
重复模式应该提前 setup、预先调度、减少动态 matching。
标准 collective 用 NCCL。
需要 GPU kernel 内自定义通信或 remote memory access 时看 NVSHMEM。
```

## MoE dispatch/combine: How to choose NVSHMEM, NCCL Device API, DeepEP/HybridEP

MoE dispatch/combine is different from ordinary dense collective. Its core is dynamic token routing:

```text
token -> router -> top-k experts -> 对应 expert 所在 GPU/rank
```

So its problem is not simple:

```text
所有 rank 对同一个 dense buffer 做 all-reduce
```

And more like:

```text
每个 token 的 destination 不同。
每个 expert 收到的 token 数不同。
每一轮 token count / offset / layout 可能变化。
需要 pack / send / unpack / combine。
需要 metadata、counter、signal、completion protocol。
```

### Why NVSHMEM is often more natural than`ncclGetLsaPointer`

`ncclGetLsaPointer`is an LSA pointer accessor in the NCCL Device API:

```text
给我 LSA peer 的 registered window pointer。
```

What it solves is:

```text
我如何在 CUDA kernel 里拿到某个 LSA peer 的 remote-accessible pointer？
```

But it itself is not responsible for:

```text
token count exchange
expert load imbalance
variable-size all-to-all
远端 offset 分配
completion signal
combine 时按原 token order 写回
top-k 输出累加
跨 node routing
```

NVSHMEM's model is closer to MoE dispatch/combine:

```text
symmetric memory:
  每个 PE/GPU 都有对应 buffer。

put/get:
  GPU kernel 可以主动读写远端 PE。

signal/wait/atomic:
  可以表达 token 到达、count 更新、completion 等 protocol。

one-sided:
  发送方可以主动把 token 写到远端 expert buffer，
  远端不一定要同时 post receive。
```

Therefore, if you write the irregular MoE prototype yourself, NVSHMEM's abstraction is usually more natural:

```text
router 决定 token -> expert。
kernel 根据 expert 找 target PE。
kernel 把 token 写到 target PE 的 symmetric buffer。
用 signal/counter 告诉 target PE 有多少 token 到了。
combine 阶段再按 routing metadata 把 expert output 写回原 rank。
```

`ncclGetLsaPointer`is more like the underlying building block, suitable for research:

```text
NCCL Device API
LSA remote pointer
NCCL communicator/window/team/barrier
```

But it's not a full MoE dispatcher.

### What does DeepEP use?

DeepEP needs to be looked at in different versions.

DeepEP V1：

```text
NVSHMEM-based。
依赖 NVSHMEM 做 GPU-driven / one-sided communication。
这就是为什么 DeepEP V1 很适合作为 NVSHMEM 做 MoE dispatch/combine 的例子。
```

DeepEP V2：

```text
切到 NCCL GIN backend。
目标是更轻量，并且复用已有 NCCL communicators。
它不是简单地裸用 ncclGetLsaPointer，
而是使用 NCCL Device API 里更偏 GPU-initiated networking 的 GIN path。
```

so:

```text
DeepEP V1:
  NVSHMEM backend。

DeepEP V2:
  NCCL GIN backend。
```

### What does HybridEP use?

HybridEP is NVIDIA's dedicated MoE EP dispatcher. It's not plain NCCL all-to-all, nor is it bare NVSHMEM.

It's more like:

```text
custom CUDA kernels
registered buffers
intra-node NVLink / NVSwitch
inter-node RDMA
IBGDA
TMA
warp-group pipeline
```

The idea is:

```text
把 dispatch/combine 拆成细粒度 pipeline。
每个 CUDA block 可以作为一个独立 data channel。
不同 warp groups 负责不同阶段：
  RDMA
  GPU-to-shared-memory
  shared-memory-to-GPU
  local reduction / combine
```

The core goals of HybridEP are:

```text
接近硬件带宽上限。
减少 SM 占用。
让通信和 expert GEMM 更好 overlap。
支持 intra-node 和 Multi-Node NVLink / RDMA 混合场景。
```

### Actual choice

| Scenario | Suggestions |
|-------------------------------------------------------- | ------------------------------------------ |
| Normal MoE baseline | NCCL all-to-all / PyTorch`all_to_all`|
| Megatron-Core Standard EP |`--moe-token-dispatcher-type alltoall`|
| Large-scale cross-node fine-grained MoE | DeepEP |
| NVIDIA Megatron-Core + intra-node / MNNVL / GB200/NVL system | HybridEP |
| Research irregular one-sided MoE kernel by yourself | NVSHMEM |
| Research NCCL Device API / LSA pointer |`ncclGetLsaPointer`|
| Research GPU-initiated networking path | NCCL GIN |

The shortest conclusion:

```text
生产 MoE dispatch/combine:
  先看 DeepEP / HybridEP。

baseline:
  用 NCCL all-to-all。

自己写 one-sided irregular prototype:
  NVSHMEM 更自然。

研究 NCCL Device API:
  看 ncclGetLsaPointer / LSA / GIN。
```

### Practical Guide: Single-node NVSHMEM/NVLink MoE dispatch simplified version

The following paragraph is from a local demo:

```text
outputs/moe_dispatch_nvshmem_single_node_deepep_like.cu
```

It is not a DeepEP replacement, but a minimal practical mental model: in a single node NVLink/NVSwitch, use NVSHMEM symmetric memory and`nvshmem_ptr()`to do one-sided dispatch/combine in the GPU kernel.

Basic assumptions:

```text
one process per GPU / one NVSHMEM PE per GPU
所有 PE 都在同一个 node
GPU 之间有 NVLink/NVSwitch P2P 可达
num_experts % num_pes == 0
每个 expert 只属于一个 PE
```

Start method:

```bash
nvcc -O3 -std=c++17 -arch=sm_90 \
  -I${NVSHMEM_HOME}/include -L${NVSHMEM_HOME}/lib \
  moe_dispatch_nvshmem_single_node_deepep_like.cu \
  -lnvshmem_host -lnvshmem_device \
  -o moe_dispatch_nvshmem_single_node

nvshmrun -np 8 ./moe_dispatch_nvshmem_single_node 128 256 16 2
```

If you connect to PyTorch training, you usually do not use`nvshmrun`to run the standalone binary directly, but instead:

```text
torchrun 启动每个 GPU 一个 Python process
rank 0 生成 NVSHMEM unique ID
torch.distributed broadcast 这个 unique ID
C++/CUDA extension 用 nvshmemx_init_attr(... UNIQUEID ...) 初始化 NVSHMEM
```

#### 1. Core buffer and metadata

dispatch not only moves`x`, but also must save the reverse path required by combine.

Minimum handle:

```cpp
struct EpHandle {
  int* src_pe;              // recv slot 来自哪个 PE
  int* src_token;           // recv slot 对应源 rank 的哪个 token
  int* recv_topk_idx;       // [max_recv_tokens, topk], 本地 expert id 或 -1
  float* recv_topk_w;       // [max_recv_tokens, topk], 被 mask 的位置为 0
  uint64_t* recv_count;     // 当前 PE 实际收到多少个 dedup token
  uint64_t* recv_count_per_expert;
  int* overflow;
};
```

Key points:

```text
recv_x:
  当前 PE 收到的 token payload。

src_pe/src_token:
  combine 时把 expert output 写回原 rank、原 token。

recv_topk_idx/recv_topk_w:
  一个 token 可能 top-k 里有多个 expert。
  dispatch 到某个 rank 时，只保留属于该 rank 的 local expert，
  不属于该 rank 的 top-k entry 设成 -1 / 0。

recv_count:
  dispatch 时远端 atomic_fetch_add 分配 slot。
```

These receive/combine buffers need to be allocated using`nvshmem_malloc`because the remote GPU kernel accesses them through NVSHMEM.

```cpp
float* recv_x = (float*)nvshmem_malloc(max_recv_tokens * hidden * sizeof(float));
float* combined_contrib =
    (float*)nvshmem_malloc(local_tokens * topk * hidden * sizeof(float));

handle.src_pe = (int*)nvshmem_malloc(max_recv_tokens * sizeof(int));
handle.src_token = (int*)nvshmem_malloc(max_recv_tokens * sizeof(int));
handle.recv_count = (uint64_t*)nvshmem_malloc(sizeof(uint64_t));
```

Ordinary local input, such as`tokens/topk_idx/topk_weight`of the current rank, can use ordinary`cudaMalloc`or PyTorch tensor, because they are only read by this rank.

#### 2. Layout: First calculate which ranks the token will be sent to.

An important semantics of DeepEP is that normal dispatch only sends tokens once to the same destination rank. That is to say, if there are two experts in the top-k of a token, both of which are in rank 3, the token payload is only sent to rank 3 once, and then expanded locally in rank 3 using top-k metadata.

Simplified logic:

```cpp
for token in local_tokens:
  for k in topk:
    expert = topk_idx[token, k]
    dst_pe = expert / experts_per_pe

    num_tokens_per_expert[expert]++

    if this is the first top-k entry of this token going to dst_pe:
      num_tokens_per_rank[dst_pe]++
      is_token_in_rank[token, dst_pe] = true
```

The obtained`is_token_in_rank[token, dst_pe]`is the work table of dispatch kernel.

#### 3. Dispatch: GPU kernel directly writes to the remote receive buffer

Each`(token, dst_pe)`starts a block. If`is_token_in_rank[token, dst_pe] == false`, return directly.

The most important part:

```cpp
slot = nvshmem_uint64_atomic_fetch_add(handle.recv_count, 1, dst_pe);

float* peer_recv_x =
    (float*)nvshmem_ptr(recv_x, dst_pe);
int* peer_src_pe =
    (int*)nvshmem_ptr(handle.src_pe, dst_pe);
int* peer_src_token =
    (int*)nvshmem_ptr(handle.src_token, dst_pe);
int* peer_topk_idx =
    (int*)nvshmem_ptr(handle.recv_topk_idx, dst_pe);
float* peer_topk_w =
    (float*)nvshmem_ptr(handle.recv_topk_w, dst_pe);

peer_recv_x[slot, h] = tokens[token, h];
peer_src_pe[slot] = my_pe;
peer_src_token[slot] = token;
```

`nvshmem_ptr()`is the core of this single-node NVLink version: if the peer GPU P2P is reachable, it returns a peer pointer that this GPU kernel can directly load/store. In this way, the dispatch hot path does not require CPU intervention, nor does the receiver need to post receive first.

Then write masked top-k metadata:

```cpp
for k in topk:
  expert = topk_idx[token, k]
  if owner(expert) == dst_pe:
    peer_topk_idx[slot, k] = expert - dst_pe * experts_per_pe
    peer_topk_w[slot, k] = topk_weight[token, k]
    atomic_add(peer_recv_count_per_expert[local_expert])
  else:
    peer_topk_idx[slot, k] = -1
    peer_topk_w[slot, k] = 0
```

What to do after dispatch:

```cpp
nvshmemx_quiet_on_stream(stream);
nvshmemx_barrier_all_on_stream(stream);
```

`quiet`ensures that the NVSHMEM sent by this PE is written to completion; barrier ensures that all PEs complete dispatch, so that the subsequent local expert can read the complete receive buffer.

#### 4. Local expert: only process the local expert entry of this rank

What the receiver sees is:

```text
recv_x[slot]
recv_topk_idx[slot, k]
recv_topk_w[slot, k]
```

If`recv_topk_idx[slot, k] < 0`, it means that this top-k entry does not belong to this rank and is skipped.

This step in a real MoE should be grouped GEMM/expert MLP. The demo is just dummy transform:

```cpp
expert_out[slot, k, h] = recv_x[slot, h] * expert_scale;
```

#### 5. Combine: press handle to write back to the source rank

combine in turn uses`src_pe/src_token`to find the original token.

Key logic:

```cpp
src_pe = handle.src_pe[slot];
src_token = handle.src_token[slot];
weight = handle.recv_topk_w[slot, k];

float* peer_combined_contrib =
    (float*)nvshmem_ptr(combined_contrib, src_pe);

peer_combined_contrib[src_token, k, h] =
    expert_out[slot, k, h] * weight;
```

Then source rank local reduce top-k:

```cpp
combined_x[token, h] = sum_k combined_contrib[token, k, h];
```

So the complete data flow is:

```text
source rank:
  x[token] + topk_idx/topk_weight

dispatch:
  x[token] -> target rank recv_x[slot]
  metadata -> target rank handle[slot]

target rank:
  local expert computes expert_out[slot, k]

combine:
  expert_out[slot, k] * weight -> source rank combined_contrib[token, k]

source rank:
  sum over k -> combined_x[token]
```

#### 6. What should I add when taking PyTorch/MoE training?

The standalone demo only demonstrates communication semantics. These layers need to be supplemented during training:

```text
Python:
  router/topk
  local expert modules 或 grouped GEMM
  torch.autograd.Function 包住 dispatch/combine

C++/CUDA extension:
  workspace 持有 nvshmem_malloc 的 symmetric buffers
  forward dispatch/combine 调 NVSHMEM kernels
  backward 走反向路由：
    combine_backward: grad_combined_x -> grad_expert_out
    dispatch_backward: grad_recv_x -> grad_x

torchrun:
  用 torch.distributed 做 rank/world/local_rank
  用 broadcast 分发 NVSHMEM unique ID
```

Training version minimum forward:

```python
topk_score, topk_idx = torch.topk(router(x), topk, dim=-1)
topk_weight = torch.softmax(topk_score.float(), dim=-1)

recv_x, recv_topk_idx, recv_topk_weight, handle = dispatch(
    x, topk_idx, topk_weight
)

expert_out = local_experts(recv_x, recv_topk_idx)
y = combine(expert_out, handle)
loss = criterion(y, target)
loss.backward()
```

#### 7. The most common mistake

```text
1. 这是 single-node NVLink/NVSwitch direct pointer path。
   nvshmem_ptr() 返回 NULL 通常说明 peer 不可直接访问。

2. slot 分配必须是远端 atomic。
   多个 source PE 会同时往同一个 target PE 写 token。

3. dispatch 是 token-deduplicated per destination rank。
   不能简单按 top-k entry 发，否则同一个 token 会被重复发送到同一 rank。

4. combine 必须保存 src_pe/src_token。
   没有这两个 metadata，就不知道 expert output 应该回到哪个原 token。

5. top-k metadata 要 mask 成 local expert id。
   receiver rank 只应该看到自己的 local expert。

6. 每轮 dispatch/combine 前要清空 counter 和 output buffer。
   否则上一轮 token 会污染下一轮。

7. `quiet + barrier` 是 correctness baseline。
   高性能实现会做更细的 signal、queue、channel、pipeline 和 overlap。

8. 这个 demo 用 float32 和 naive copy。
   DeepEP/HybridEP 会用 BF16/FP8、vectorized copy、TMA、channel queue、SM tuning。
```

refer to:

- DeepEP README: <https://github.com/deepseek-ai/DeepEP>
- DeepEP V1 legacy docs: <https://github.com/deepseek-ai/DeepEP/blob/main/docs/legacy.md>
- Megatron-Core MoE docs: <https://docs.nvidia.com/megatron-core/developer-guide/0.15.0/api-guide/moe.html>
- NVIDIA HybridEP blog: <https://developer.nvidia.com/blog/optimizing-communication-for-mixture-of-experts-training-with-hybrid-expert-parallel/>