This page provides an overview of the PyTorch Distributed Benchmarks, which are designed to measure the performance of distributed training on various hardware configurations.

Key Features

  • Scalability: Benchmark results showcase the scalability of PyTorch Distributed across different numbers of GPUs.
  • Efficiency: Insights into the efficiency of data parallelism and model parallelism.
  • Hardware Compatibility: Performance data for various hardware setups, including CPUs, GPUs, and TPUs.

Benchmark Results

Here are some of the key results from the benchmarks:

  • Data Parallelism:

    • Training throughput increases linearly with the number of GPUs.
    • Latency remains low even with a large number of GPUs.
  • Model Parallelism:

    • The benchmark demonstrates the ability to handle large models that exceed the GPU memory capacity.
    • Performance is optimized for different types of models.

System Configuration

The benchmarks were conducted on the following system configurations:

  • CPU: Intel Xeon Gold 6242
  • GPU: NVIDIA Tesla V100
  • TPU: Google Cloud TPU v3-8

Learn More

For a detailed analysis of the benchmarks, check out the PyTorch Distributed Benchmark Report.

NVIDIA Tesla V100 GPU

Return to Home