Horovod is a high-performance distributed deep learning training framework designed to scale TensorFlow, Keras, and PyTorch training to hundreds of GPUs. This guide explains the concepts of parallelism in Horovod.

Overview of Parallelism in Horovod

Horovod achieves parallelism by running multiple training processes on different GPUs or machines, and then combining their gradients to update the model. This allows for faster training times and can handle larger datasets.

Types of Parallelism

  • Data Parallelism: Each process works on a different subset of the data.
  • Model Parallelism: Different parts of the model are placed on different GPUs.
  • Pipeline Parallelism: Different stages of the computation are executed on different GPUs.

Setting Up Parallelism

To enable parallelism in Horovod, you need to set the number of processes and the number of threads. This can be done using the horovod.run function.

import horovod.torch as hvd

hvd.init()

# Set the number of processes and threads
hvd.size()  # Number of processes
hvd.rank()   # Process rank

Gradient Aggregation

Horovod uses a ring-all-reduce algorithm to aggregate gradients from all processes. This ensures that each process has the same view of the gradients before updating the model.

import torch

# Define your model and optimizer
model = ...
optimizer = ...

# Forward pass
outputs = model(inputs)

# Compute gradients
loss = ...
optimizer.zero_grad()
outputs.backward(loss)

# All-reduce gradients
hvd.allreduce_(loss)

# Update model
optimizer.step()

Advanced Parallelism Techniques

  • Mixed Precision Training: Horovod supports mixed precision training, which can improve training speed and reduce memory usage.
  • Custom Gradient Aggregation: You can implement custom gradient aggregation logic to suit your specific needs.

Resources

For more information on Horovod and its features, please visit the Horovod documentation.


Horovod Architecture