Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

a5c-ai/nccl-communication

Name: nccl-communication
Author: a5c-ai

library/specializations/gpu-programming/skills/nccl-communication/SKILL.md

npx skillsauth add a5c-ai/babysitter nccl-communication

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

nccl-communication

You are nccl-communication - a specialized skill for NVIDIA Collective Communications Library (NCCL) integration. This skill provides expert capabilities for multi-GPU collective operations.

Overview

This skill enables AI-powered multi-GPU communication including:

Initialize NCCL communicators
Execute all-reduce, all-gather, reduce-scatter operations
Configure ring and tree communication topologies
Handle multi-node NCCL communication
Profile collective operation performance
Optimize for NVLink vs PCIe topology
Integrate with CUDA streams for async collectives
Support RCCL for AMD GPU compatibility

Prerequisites

CUDA Toolkit 11.0+
NCCL 2.10+
Multiple GPUs (for meaningful use)
MPI (for multi-node, optional)

Capabilities

1. NCCL Initialization

Initialize communicators:

#include <nccl.h>

// Single-node multi-GPU initialization
int numGPUs = 4;
ncclComm_t comms[4];
int devs[4] = {0, 1, 2, 3};

ncclCommInitAll(comms, numGPUs, devs);

// Per-rank initialization for MPI integration
ncclUniqueId id;
ncclComm_t comm;

if (rank == 0) {
    ncclGetUniqueId(&id);
}
MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD);

cudaSetDevice(localRank);
ncclCommInitRank(&comm, worldSize, id, rank);

// Cleanup
ncclCommDestroy(comm);

2. All-Reduce Operations

Reduce across all GPUs:

// Synchronous all-reduce
ncclAllReduce(sendbuff, recvbuff, count, ncclFloat,
    ncclSum, comm, stream);
cudaStreamSynchronize(stream);

// In-place all-reduce
ncclAllReduce(buff, buff, count, ncclFloat, ncclSum, comm, stream);

// Supported reduction operations:
// ncclSum, ncclProd, ncclMax, ncclMin, ncclAvg

// Multiple data types:
// ncclInt8, ncclUint8, ncclInt32, ncclUint32, ncclInt64, ncclUint64
// ncclFloat16, ncclFloat32, ncclFloat64, ncclBfloat16

3. All-Gather Operations

Gather data from all GPUs:

// All-gather: each GPU contributes sendcount elements
// Result: recvbuff has numGPUs * sendcount elements per GPU
ncclAllGather(sendbuff, recvbuff, sendcount, ncclFloat, comm, stream);

// Verify output size
size_t totalElements = sendcount * numGPUs;

4. Reduce-Scatter Operations

// Reduce-scatter: reduces and scatters to each GPU
// Each GPU gets 1/numGPUs of the reduced result
ncclReduceScatter(sendbuff, recvbuff, recvcount, ncclFloat,
    ncclSum, comm, stream);

// Useful for gradient reduction in data parallelism

5. Broadcast and Reduce

// Broadcast from root to all
int root = 0;
ncclBroadcast(sendbuff, recvbuff, count, ncclFloat, root, comm, stream);

// In-place broadcast
ncclBroadcast(buff, buff, count, ncclFloat, root, comm, stream);

// Reduce to root
ncclReduce(sendbuff, recvbuff, count, ncclFloat, ncclSum, root, comm, stream);

6. Group Operations

Batch multiple operations:

// Start group
ncclGroupStart();

// Queue multiple operations
ncclAllReduce(buff1, buff1, count1, ncclFloat, ncclSum, comm, stream);
ncclAllReduce(buff2, buff2, count2, ncclFloat, ncclSum, comm, stream);
ncclBroadcast(buff3, buff3, count3, ncclFloat, 0, comm, stream);

// End group - operations execute efficiently
ncclGroupEnd();

// Useful for:
// - Multiple collectives in single launch
// - Send/Recv pairs for point-to-point

7. Point-to-Point Communication

// Send from rank 0 to rank 1
if (rank == 0) {
    ncclSend(sendbuff, count, ncclFloat, 1, comm, stream);
} else if (rank == 1) {
    ncclRecv(recvbuff, count, ncclFloat, 0, comm, stream);
}

// Bidirectional exchange using groups
ncclGroupStart();
ncclSend(sendbuff, count, ncclFloat, peerRank, comm, stream);
ncclRecv(recvbuff, count, ncclFloat, peerRank, comm, stream);
ncclGroupEnd();

8. Topology Optimization

Configure for hardware topology:

# Check GPU topology
nvidia-smi topo -m

# Environment variables for optimization
export NCCL_TOPO_FILE=/path/to/topo.xml
export NCCL_GRAPH_FILE=/path/to/graph.xml

# Algorithm selection
export NCCL_ALGO=Tree       # Tree reduction
export NCCL_ALGO=Ring       # Ring reduction
export NCCL_ALGO=CollnetDirect  # NVSwitch direct

# Protocol selection
export NCCL_PROTO=Simple    # Default
export NCCL_PROTO=LL        # Low-latency
export NCCL_PROTO=LL128     # Low-latency 128-byte

# Network settings
export NCCL_IB_DISABLE=0    # Enable InfiniBand
export NCCL_NET_GDR_LEVEL=5 # GPU Direct RDMA level

9. Multi-Node Setup

// Multi-node with MPI
#include <mpi.h>
#include <nccl.h>

int main(int argc, char* argv[]) {
    MPI_Init(&argc, &argv);

    int worldSize, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // Get local rank for GPU assignment
    int localRank;
    MPI_Comm localComm;
    MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank,
        MPI_INFO_NULL, &localComm);
    MPI_Comm_rank(localComm, &localRank);

    // Initialize NCCL
    ncclUniqueId id;
    if (rank == 0) ncclGetUniqueId(&id);
    MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD);

    cudaSetDevice(localRank);
    ncclComm_t comm;
    ncclCommInitRank(&comm, worldSize, id, rank);

    // Use comm for collectives...

    ncclCommDestroy(comm);
    MPI_Finalize();
    return 0;
}

10. Performance Profiling

// NCCL timing with CUDA events
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, stream);
ncclAllReduce(buff, buff, count, ncclFloat, ncclSum, comm, stream);
cudaEventRecord(stop, stream);
cudaEventSynchronize(stop);

float milliseconds;
cudaEventElapsedTime(&milliseconds, start, stop);

// Calculate bandwidth
size_t bytes = count * sizeof(float);
float algoBW = bytes / milliseconds / 1e6;  // GB/s
float busBW = algoBW * 2 * (numGPUs - 1) / numGPUs;  // Bus bandwidth
printf("AllReduce: %.2f ms, %.2f GB/s (bus: %.2f GB/s)\n",
    milliseconds, algoBW, busBW);

# Enable NCCL debug output
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# NCCL tests for benchmarking
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 4

Process Integration

This skill integrates with the following processes:

multi-gpu-programming.js - Multi-GPU development
gpu-cluster-computing.js - Cluster computing

Output Format

{
  "operation": "all-reduce",
  "status": "success",
  "configuration": {
    "num_gpus": 4,
    "data_size_bytes": 268435456,
    "data_type": "float32",
    "reduction": "sum"
  },
  "performance": {
    "time_ms": 2.34,
    "algorithm_bandwidth_gbps": 114.5,
    "bus_bandwidth_gbps": 171.8
  },
  "topology": {
    "interconnect": "NVLink",
    "algorithm": "Tree",
    "protocol": "LL128"
  }
}

Dependencies

CUDA Toolkit 11.0+
NCCL 2.10+
MPI (optional, for multi-node)

Constraints

All ranks must call collective operations in same order
Buffer sizes must match across ranks
Stream ordering must be consistent
Group operations must be balanced (send/recv pairs)

a5c-ai/nccl-communication

library/specializations/gpu-programming/skills/nccl-communication/SKILL.md

NVIDIA Collective Communications Library integration for multi-GPU operations. Initialize NCCL communicators, execute collective operations, configure communication topologies, profile collective performance, and support RCCL for AMD compatibility.

514 stars

development

Updated Apr 2, 2026

$ install --global

skillsauth

npx skillsauth add a5c-ai/babysitter nccl-communication

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 2, 2026, 12:29 PM64.8s2 files scanned

SKILL.md

name:: nccl-communication
description:: NVIDIA Collective Communications Library integration for multi-GPU operations. Initialize NCCL communicators, execute collective operations, configure communication topologies, profile collective performance, and support RCCL for AMD compatibility.
allowed-tools:: Bash(*) Read Write Edit Glob Grep WebFetch
author:: babysitter-sdk
version:: 1.0.0
category:: multi-gpu
backlog-id:: SK-007

nccl-communication

You are nccl-communication - a specialized skill for NVIDIA Collective Communications Library (NCCL) integration. This skill provides expert capabilities for multi-GPU collective operations.

Overview

This skill enables AI-powered multi-GPU communication including:

Initialize NCCL communicators
Execute all-reduce, all-gather, reduce-scatter operations
Configure ring and tree communication topologies
Handle multi-node NCCL communication
Profile collective operation performance
Optimize for NVLink vs PCIe topology
Integrate with CUDA streams for async collectives
Support RCCL for AMD GPU compatibility

Prerequisites

CUDA Toolkit 11.0+
NCCL 2.10+
Multiple GPUs (for meaningful use)
MPI (for multi-node, optional)

Capabilities

1. NCCL Initialization

Initialize communicators:

#include <nccl.h>

// Single-node multi-GPU initialization
int numGPUs = 4;
ncclComm_t comms[4];
int devs[4] = {0, 1, 2, 3};

ncclCommInitAll(comms, numGPUs, devs);

// Per-rank initialization for MPI integration
ncclUniqueId id;
ncclComm_t comm;

if (rank == 0) {
    ncclGetUniqueId(&id);
}
MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD);

cudaSetDevice(localRank);
ncclCommInitRank(&comm, worldSize, id, rank);

// Cleanup
ncclCommDestroy(comm);

2. All-Reduce Operations

Reduce across all GPUs:

// Synchronous all-reduce
ncclAllReduce(sendbuff, recvbuff, count, ncclFloat,
    ncclSum, comm, stream);
cudaStreamSynchronize(stream);

// In-place all-reduce
ncclAllReduce(buff, buff, count, ncclFloat, ncclSum, comm, stream);

// Supported reduction operations:
// ncclSum, ncclProd, ncclMax, ncclMin, ncclAvg

// Multiple data types:
// ncclInt8, ncclUint8, ncclInt32, ncclUint32, ncclInt64, ncclUint64
// ncclFloat16, ncclFloat32, ncclFloat64, ncclBfloat16

3. All-Gather Operations

Gather data from all GPUs:

// All-gather: each GPU contributes sendcount elements
// Result: recvbuff has numGPUs * sendcount elements per GPU
ncclAllGather(sendbuff, recvbuff, sendcount, ncclFloat, comm, stream);

// Verify output size
size_t totalElements = sendcount * numGPUs;

4. Reduce-Scatter Operations

// Reduce-scatter: reduces and scatters to each GPU
// Each GPU gets 1/numGPUs of the reduced result
ncclReduceScatter(sendbuff, recvbuff, recvcount, ncclFloat,
    ncclSum, comm, stream);

// Useful for gradient reduction in data parallelism

5. Broadcast and Reduce

// Broadcast from root to all
int root = 0;
ncclBroadcast(sendbuff, recvbuff, count, ncclFloat, root, comm, stream);

// In-place broadcast
ncclBroadcast(buff, buff, count, ncclFloat, root, comm, stream);

// Reduce to root
ncclReduce(sendbuff, recvbuff, count, ncclFloat, ncclSum, root, comm, stream);

6. Group Operations

Batch multiple operations:

// Start group
ncclGroupStart();

// Queue multiple operations
ncclAllReduce(buff1, buff1, count1, ncclFloat, ncclSum, comm, stream);
ncclAllReduce(buff2, buff2, count2, ncclFloat, ncclSum, comm, stream);
ncclBroadcast(buff3, buff3, count3, ncclFloat, 0, comm, stream);

// End group - operations execute efficiently
ncclGroupEnd();

// Useful for:
// - Multiple collectives in single launch
// - Send/Recv pairs for point-to-point

7. Point-to-Point Communication

// Send from rank 0 to rank 1
if (rank == 0) {
    ncclSend(sendbuff, count, ncclFloat, 1, comm, stream);
} else if (rank == 1) {
    ncclRecv(recvbuff, count, ncclFloat, 0, comm, stream);
}

// Bidirectional exchange using groups
ncclGroupStart();
ncclSend(sendbuff, count, ncclFloat, peerRank, comm, stream);
ncclRecv(recvbuff, count, ncclFloat, peerRank, comm, stream);
ncclGroupEnd();

8. Topology Optimization

Configure for hardware topology:

# Check GPU topology
nvidia-smi topo -m

# Environment variables for optimization
export NCCL_TOPO_FILE=/path/to/topo.xml
export NCCL_GRAPH_FILE=/path/to/graph.xml

# Algorithm selection
export NCCL_ALGO=Tree       # Tree reduction
export NCCL_ALGO=Ring       # Ring reduction
export NCCL_ALGO=CollnetDirect  # NVSwitch direct

# Protocol selection
export NCCL_PROTO=Simple    # Default
export NCCL_PROTO=LL        # Low-latency
export NCCL_PROTO=LL128     # Low-latency 128-byte

# Network settings
export NCCL_IB_DISABLE=0    # Enable InfiniBand
export NCCL_NET_GDR_LEVEL=5 # GPU Direct RDMA level

9. Multi-Node Setup

// Multi-node with MPI
#include <mpi.h>
#include <nccl.h>

int main(int argc, char* argv[]) {
    MPI_Init(&argc, &argv);

    int worldSize, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // Get local rank for GPU assignment
    int localRank;
    MPI_Comm localComm;
    MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank,
        MPI_INFO_NULL, &localComm);
    MPI_Comm_rank(localComm, &localRank);

    // Initialize NCCL
    ncclUniqueId id;
    if (rank == 0) ncclGetUniqueId(&id);
    MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD);

    cudaSetDevice(localRank);
    ncclComm_t comm;
    ncclCommInitRank(&comm, worldSize, id, rank);

    // Use comm for collectives...

    ncclCommDestroy(comm);
    MPI_Finalize();
    return 0;
}

10. Performance Profiling

// NCCL timing with CUDA events
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, stream);
ncclAllReduce(buff, buff, count, ncclFloat, ncclSum, comm, stream);
cudaEventRecord(stop, stream);
cudaEventSynchronize(stop);

float milliseconds;
cudaEventElapsedTime(&milliseconds, start, stop);

// Calculate bandwidth
size_t bytes = count * sizeof(float);
float algoBW = bytes / milliseconds / 1e6;  // GB/s
float busBW = algoBW * 2 * (numGPUs - 1) / numGPUs;  // Bus bandwidth
printf("AllReduce: %.2f ms, %.2f GB/s (bus: %.2f GB/s)\n",
    milliseconds, algoBW, busBW);

# Enable NCCL debug output
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# NCCL tests for benchmarking
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 4

Process Integration

This skill integrates with the following processes:

multi-gpu-programming.js - Multi-GPU development
gpu-cluster-computing.js - Cluster computing

Output Format

{
  "operation": "all-reduce",
  "status": "success",
  "configuration": {
    "num_gpus": 4,
    "data_size_bytes": 268435456,
    "data_type": "float32",
    "reduction": "sum"
  },
  "performance": {
    "time_ms": 2.34,
    "algorithm_bandwidth_gbps": 114.5,
    "bus_bandwidth_gbps": 171.8
  },
  "topology": {
    "interconnect": "NVLink",
    "algorithm": "Tree",
    "protocol": "LL128"
  }
}

Dependencies

CUDA Toolkit 11.0+
NCCL 2.10+
MPI (optional, for multi-node)

Constraints

All ranks must call collective operations in same order
Buffer sizes must match across ranks
Stream ordering must be consistent
Group operations must be balanced (send/recv pairs)

Related Skills

a5c-ai/model-card-generator

development

VerifiedTrustedCommunity

Model documentation skill for generating model cards following Google's model card framework.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/model-card-generator

a5c-ai/mlflow-experiment-tracker

development

VerifiedTrustedCommunity

MLflow integration skill for experiment tracking, model registry, and artifact management. Enables LLMs to log experiments, compare runs, manage model lifecycle, and retrieve artifacts through the MLflow API.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/mlflow-experiment-tracker

a5c-ai/lime-explainer

data-ai

VerifiedTrustedCommunity

LIME-based local explanation skill for individual predictions across tabular, text, and image data.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/lime-explainer

a5c-ai/kubeflow-pipeline-executor

devops

VerifiedTrustedCommunity

Kubeflow Pipelines skill for ML workflow orchestration, component management, and Kubernetes-native ML.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/kubeflow-pipeline-executor

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/a5c-ai/babysitter.git

# Copy into Claude Code skills folder (global)
cp -r babysitter/library/specializations/gpu-programming/skills/nccl-communication ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

a5c-ai/babysitter

514 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT