PyTorch Distributed Benchmarks

This page provides an overview of the PyTorch Distributed Benchmarks, which are designed to measure the performance of distributed training on various hardware configurations.

Key Features

Scalability: Benchmark results showcase the scalability of PyTorch Distributed across different numbers of GPUs.
Efficiency: Insights into the efficiency of data parallelism and model parallelism.
Hardware Compatibility: Performance data for various hardware setups, including CPUs, GPUs, and TPUs.

Benchmark Results

Here are some of the key results from the benchmarks:

Data Parallelism:
- Training throughput increases linearly with the number of GPUs.
- Latency remains low even with a large number of GPUs.
Model Parallelism:
- The benchmark demonstrates the ability to handle large models that exceed the GPU memory capacity.
- Performance is optimized for different types of models.

System Configuration

The benchmarks were conducted on the following system configurations:

CPU: Intel Xeon Gold 6242
GPU: NVIDIA Tesla V100
TPU: Google Cloud TPU v3-8

Learn More

For a detailed analysis of the benchmarks, check out the PyTorch Distributed Benchmark Report.

Return to Home