Horovod is an open-source distributed training framework for TensorFlow, Keras, and PyTorch. It enables efficient distributed training on multiple GPUs and multiple machines. This document provides an overview of the cluster architecture required for Horovod.
Overview
A Horovod cluster typically consists of the following components:
- Computational Nodes: These are the machines that perform the actual computation. Each node can have one or more GPUs.
- Master Node: This is the node that coordinates the training process. It manages the allocation of tasks to worker nodes and collects the results.
- Network: The network connects all the nodes in the cluster. It should be high-speed and low-latency to ensure efficient communication.
Components
Computational Nodes
Computational nodes are the backbone of the Horovod cluster. They are responsible for executing the training tasks. Each node should have the following components:
- GPU: Horovod supports distributed training across multiple GPUs on a single node.
- CPU: The CPU is used for running the Horovod runtime and other system processes.
- Memory: Sufficient memory is required to store the model and intermediate results.
Master Node
The master node is responsible for coordinating the training process. It performs the following tasks:
- Task Allocation: The master node assigns tasks to worker nodes based on their availability and capabilities.
- Result Collection: The master node collects the results from worker nodes and aggregates them to produce the final result.
Network
The network connects all the nodes in the cluster. It should meet the following criteria:
- High-Speed: The network should have high bandwidth to facilitate efficient data transfer.
- Low-Latency: The network should have low latency to minimize communication delays.
- Reliability: The network should be reliable to ensure uninterrupted communication.
Setup
To set up a Horovod cluster, you need to perform the following steps:
- Install Horovod: Install Horovod on all the nodes in the cluster.
- Configure Network: Configure the network to ensure high-speed and low-latency communication.
- Start Master Node: Start the master node and configure it to manage the training process.
- Start Worker Nodes: Start the worker nodes and configure them to connect to the master node.
Example
Here's an example of how to set up a Horovod cluster with two worker nodes and one master node:
# Install Horovod on all nodes
pip install horovod
# Configure network
# (This step depends on your specific network setup)
# Start master node
python -m horovod.runner --addr <master-node-ip> --port <master-node-port>
# Start worker nodes
python -m horovod.runner --addr <master-node-ip> --port <master-node-port> --local-addr <worker-node-ip>:<worker-node-port>
For more detailed instructions, please refer to the Horovod documentation.