Skip to main content

Validate & Run the First Training Job

The cluster is built. The switches are configured. The hosts are configured. Pods can start.

Do not run a real training job yet. First, validate. The cost of a 256-GPU job hitting a misconfiguration is hours of compute time burned for nothing. The validation pyramid takes ~30 minutes and saves days.


The validation pyramid

Validate bottom-up. Each layer depends on the one below.

┌──────────────────┐
│ Real training job │ Layer 6
└──────────────────┘
┌──────────────────────┐
│ nccl-tests AllReduce │ Layer 5
└──────────────────────┘
┌──────────────────────────┐
│ ib_write_bw (pod-to-pod)│ Layer 4
└──────────────────────────┘
┌──────────────────────────────┐
│ PFC counters · ECN marking │ Layer 3
└──────────────────────────────┘
┌──────────────────────────────────┐
│ BGP up · ECMP balanced │ Layer 2
└──────────────────────────────────┘
┌──────────────────────────────────────┐
│ Links up · optics healthy · cables OK │ Layer 1
└──────────────────────────────────────┘

If a layer fails, fix it before moving up. Don't skip a layer "because it's probably fine."


# On every switch
show interface status
show interface counters errors

# Look for:
# - All training-fabric ports "connected" at 400G (not 200G or down)
# - Zero CRC errors, zero input/output errors in steady state
# - Optical levels in spec (typically -3 to -7 dBm RX)

What good looks like:

  • 100% of ports up at the rated speed
  • Zero CRC errors after a 5-minute soak
  • Optical RX power within vendor-specified range

Common failure:

  • One port "connected" at 100G instead of 400G — usually a transceiver mismatch or a damaged cable
  • Persistent CRC errors — bad optic or cable; replace and retest

Layer 2: control plane — BGP & ECMP

show ip bgp summary
show ip route bgp
show ip route ecmp

# Look for:
# - Every spine as established BGP neighbor on every leaf
# - Every leaf prefix has 4 ECMP next-hops (one per spine)

Pinging across the fabric (server to server, not RDMA — just IP):

# From server-01 to server-32, all 8 rails
for i in 0 1 2 3 4 5 6 7; do
ping -c 3 -I 10.5${i}.0.32 10.5${i}.0.1
done

What good looks like:

  • All pings succeed
  • Latency consistent (low microseconds within rack, low milliseconds across rack)
  • No packet loss

Common failure:

  • One rail's ping fails — check that pod's IP plan, check the NAD config
  • All pings fail — BGP isn't fully converged, or ECMP isn't installing routes

Layer 3: QoS — PFC & ECN counters

Capture a baseline (no load):

show priority-flow-control counters
show qos counters

Baseline expectations: counters near zero. PFC RX/TX should be 0. ECN marks should be 0.

Now induce congestion. From a few servers, fire a ib_write_bw to one server simultaneously:

# On 8 source servers (parallel):
ib_write_bw -d mlx5_0 --duration 30 --rate_limit_type=PP \
--bw_limit=40 <target-server-ip> &

Watch the counters on the target's leaf switch:

What good looks like:

  • ECN marking activates (ECN marked counter rises)
  • DCQCN on senders dials back rates (visible via NIC np_cnp_handled counters)
  • PFC fires only briefly, if at all (PFC TX counter increments slowly, then stops)
  • No drops (out_of_buffer_discards stays 0)

Common failure:

  • PFC fires constantly, ECN never marks — min_th too high. Tune down.
  • ECN marks but DCQCN doesn't react — DCQCN not enabled on the NIC. Check /sys/class/net/.../ecn/roce_rp/enable/3.
  • Drops happen — headroom underprovisioned or wrong PFC priority. Recheck switch config.

Layer 4: RDMA verbs — ib_write_bw pod-to-pod

Launch two test pods on different servers:

kubectl apply -f rdma-test-server.yaml
kubectl apply -f rdma-test-client.yaml

Each test pod has the same Multus + SR-IOV config as a real training pod but runs perftest binaries.

# In the server pod:
ib_write_bw -d mlx5_0 -s 65536 -F --report_gbits

# In the client pod (after server is listening):
ib_write_bw -d mlx5_0 -s 65536 -F --report_gbits <server-pod-ip-on-rail-0>

What good looks like:

Message sizeExpected throughput @ 400 G NIC
1 KB~3 Gbps (limited by message rate)
64 KB~280 Gbps
1 MB+350-380 Gbps (95% of line rate)

If you don't hit 95% on large messages, something is wrong — bad cable, bad QoS, or NIC misconfig.

Repeat across all 8 rails to confirm each independently:

for nic in mlx5_0 mlx5_1 ... mlx5_7; do
ib_write_bw -d $nic -s 1048576 -F --report_gbits <peer-pod>
done

All 8 should hit ~370+ Gbps.


Layer 5: NCCL — AllReduce throughput

nccl-tests is the gold standard for "does the whole stack work?"

git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests
make MPI=1

Run on 2-server / 16-GPU first:

mpirun -np 16 -H server-01:8,server-02:8 \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
-x NCCL_IB_GID_INDEX=3 \
./build/all_reduce_perf -b 8 -e 4G -f 2 -g 1

What good looks like (H100, 8 NICs × 400G per server):

Message sizeExpected algbw (algorithm bandwidth)
1 MB~50 GB/s
64 MB~200 GB/s
1 GB~300+ GB/s
4 GB~330 GB/s

If you're seeing half these numbers, GPUDirect RDMA isn't active:

lsmod | grep nvidia_peermem # must be loaded

Other typical issues:

  • NCCL using only 1 NIC → NCCL_IB_HCA env var missing or wrong
  • "WARN: skipping ..." in NCCL_DEBUG output → check the warning, usually a topology file issue

Once 2-server works, scale up: 4 servers, 8 servers, full pod. Throughput should stay close to per-server numbers (slight degradation due to AllReduce overhead).


Layer 6: a real training step

The capstone test: run a real training framework, measure step time.

For a 175 B-parameter model on 256 GPUs:

# Using NVIDIA's Megatron-LM reference benchmark
docker run --gpus all -it \
nvcr.io/nvidia/pytorch:24.10-py3 bash

# Inside container:
cd /workspace/Megatron-LM
torchrun --nproc_per_node=8 --nnodes=32 --node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR --master_port=29500 \
pretrain_gpt.py \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 4 \
--num-layers 96 --hidden-size 12288 \
--seq-length 2048 --global-batch-size 1536 \
--train-iters 100

What good looks like:

  • Step time stable across iterations (variance < 5%)
  • Step time within ~10% of theoretical (depends on model, framework efficiency)
  • No NCCL timeouts
  • Per-rank progress logged consistently

Capture a baseline. The first 100 training steps from a clean cluster build = your reference for everything that follows.


When validation fails, where to look

A symptom → likely cause cheat sheet:

Validation stepSymptomFirst place to check
Layer 1Port at 100G instead of 400GTransceiver mismatch (NIC vs switch optic)
Layer 2ECMP shows 1 path instead of 4BGP maximum-paths config on the leaf
Layer 3Drops despite PFCHeadroom underprovisioned (cable length mismatch)
Layer 4ib_write_bw lowDCQCN not enabled, or wrong DSCP / SL config
Layer 5NCCL slowGPUDirect not active (nvidia_peermem missing)
Layer 6Step time varies wildlyOne slow rank → check NUMA, check rail health

The "go live" gate

Before declaring the cluster production-ready:

✓ Layer 1 — 100% ports up, 0 CRC errors after 1-hour soak
✓ Layer 2 — All BGP up, ECMP balanced, full mesh ping pass
✓ Layer 3 — PFC + ECN baselines captured, ECN responds to induced load
✓ Layer 4 — ib_write_bw hits >350 Gbps on all 8 rails
✓ Layer 5 — nccl-tests AllReduce within 15% of expected at full pod scale
✓ Layer 6 — A real training job ran for >24 hours with stable step time
✓ Telemetry plumbing — Grafana dashboards capture the five golden signals

Skip any check at your peril.


What you should remember

  • Validate bottom-up. Each layer depends on the one below.
  • Capture a baseline before going live. Without it, you can't tell "broken" from "normal."
  • The most common bug is GPUDirect not active — easy to miss, halves throughput.
  • ib_write_bw proves the wire works. nccl-tests proves the stack works. A real training job proves they work together.
  • 30 minutes of validation saves days of debugging when a real job starts misbehaving.

You're done. Cluster is built, configured, and validated. Head back to the curriculum index, or jump to Production Operations to set up the monitoring and incident response that keeps this running.