Setting up MPI (Message Passing Interface) is essential for distributed training with Horovod. Below are key steps to configure MPI on your system:

1. Install MPI Implementation

  • OpenMPI:
    sudo apt-get install openmpi-bin openmpi-dev
    
    📌 Visit our guide for detailed installation steps
  • MPICH:
    wget https://www.mpich.org/download/stable/v4.0.2/mpt-4.0.2.tar.gz
    tar -xzf mpt-4.0.2.tar.gz
    cd mpich-4.0.2
    ./configure --prefix=/usr/local
    make
    sudo make install
    

2. Verify MPI Installation

Run the following command to check if MPI is properly installed:

mpiexec --version

✅ Expected output: MPI: Open MPI 4.1.4 (or your installed version)

3. Configure Horovod with MPI

  • Set environment variables:
    export HOROVOD_MPI_THREADS_NUM=2
    export HOROVOD_GPU_ALLREDUCE=nccl
    
  • Install Horovod:
    pip install horovod
    

4. Run Distributed Training

Use mpiexec to launch training across multiple nodes:

mpiexec -n 4 python train_script.py

🧠 For more examples, check Horovod's distributed training tutorial.


MPI Setup
📌 *Figure: MPI setup architecture for multi-node training*