This page provides an overview of the PyTorch Distributed Benchmarks, which are designed to measure the performance of distributed training on various hardware configurations.
Key Features
- Scalability: Benchmark results showcase the scalability of PyTorch Distributed across different numbers of GPUs.
- Efficiency: Insights into the efficiency of data parallelism and model parallelism.
- Hardware Compatibility: Performance data for various hardware setups, including CPUs, GPUs, and TPUs.
Benchmark Results
Here are some of the key results from the benchmarks:
Data Parallelism:
- Training throughput increases linearly with the number of GPUs.
- Latency remains low even with a large number of GPUs.
Model Parallelism:
- The benchmark demonstrates the ability to handle large models that exceed the GPU memory capacity.
- Performance is optimized for different types of models.
System Configuration
The benchmarks were conducted on the following system configurations:
- CPU: Intel Xeon Gold 6242
- GPU: NVIDIA Tesla V100
- TPU: Google Cloud TPU v3-8
Learn More
For a detailed analysis of the benchmarks, check out the PyTorch Distributed Benchmark Report.
NVIDIA Tesla V100 GPU