GPU Support#

Overview#

PyLops-mpi supports computations on GPUs leveraging the GPU backend of PyLops. Under the hood, CuPy (cupy-cudaXX>=v13.0.0) is used to perform all of the operations. This library must be installed before PyLops-mpi is installed.

Note

Set the environment variable CUPY_PYLOPS=0 to force PyLops to ignore the cupy backend. This can be also used if a previous (or faulty) version of cupy is installed in your system, otherwise you will get an error when importing PyLops.

The pylops_mpi.DistributedArray and pylops_mpi.StackedDistributedArray objects can be generated using both numpy and cupy based local arrays, and all of the operators and solvers in PyLops-mpi can handle both scenarios. Note that, since most operators in PyLops-mpi are thin-wrappers around PyLops operators, some of the operators in PyLops that lack a GPU implementation cannot be used also in PyLops-mpi when working with cupy arrays.

Note

By default when using cupy arrays, PyLops-MPI will try to use methods in MPI4Py that communicate memory buffers. However, this requires a CUDA-Aware MPI installation. If your MPI installation is not CUDA-Aware, set the environment variable PYLOPS_MPI_CUDA_AWARE=0 to force PyLops-MPI to use methods in MPI4Py that communicate general Python objects (this will incur a loss of performance!).

Moreover, PyLops-MPI also supports the Nvidia’s Collective Communication Library (NCCL) for highly-optimized collective operations, such as AllReduce, AllGather, etc. This allows PyLops-MPI users to leverage the proprietary technology like NVLink that might be available in their infrastructure for fast data communication.

Note

Set environment variable NCCL_PYLOPS_MPI=0 to explicitly force PyLops-MPI to ignore the NCCL backend. However, this is optional as users may opt-out for NCCL by skip passing cupy.cuda.nccl.NcclCommunicator to the pylops_mpi.DistributedArray.

In summary:

Operation model

Enabled with

Disabled with

NumPy + MPI

Default

Cannot be disabled

CuPy + MPI

PYLOPS_MPI_CUDA_AWARE=0

PYLOPS_MPI_CUDA_AWARE=1 (default)

CuPy + CUDA-Aware MPI

PYLOPS_MPI_CUDA_AWARE=1 (default)

PYLOPS_MPI_CUDA_AWARE=0

CuPy + NCCL

NCCL_PYLOPS_MPI=1 (default)

NCCL_PYLOPS_MPI=0

Example#

Finally, let’s briefly look at an example. First we write a code snippet using numpy arrays which PyLops-MPI will run on your CPU:

# MPI helpers
comm = MPI.COMM_WORLD
rank = MPI.COMM_WORLD.Get_rank()
size = MPI.COMM_WORLD.Get_size()

# Create distributed data (broadcast)
nxl, nt = 20, 20
dtype = np.float32
d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt,
                                     partition=pylops_mpi.Partition.BROADCAST,
                                     engine="numpy", dtype=dtype)
d_dist[:] = np.ones(d_dist.local_shape, dtype=dtype)

# Create and apply VStack operator
Sop = pylops.MatrixMult(np.ones((nxl, nxl)), otherdims=(nt, ))
HOp = pylops_mpi.MPIVStack(ops=[Sop, ])
y_dist = HOp @ d_dist

Now we write a code snippet using cupy arrays which PyLops will run on your GPU:

# MPI helpers
comm = MPI.COMM_WORLD
rank = MPI.COMM_WORLD.Get_rank()
size = MPI.COMM_WORLD.Get_size()

# Define gpu to use
cp.cuda.Device(device=rank).use()

# Create distributed data (broadcast)
nxl, nt = 20, 20
dtype = np.float32
d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt,
                                     partition=pylops_mpi.Partition.BROADCAST,
                                     engine="cupy", dtype=dtype)
d_dist[:] = cp.ones(d_dist.local_shape, dtype=dtype)

# Create and apply VStack operator
Sop = pylops.MatrixMult(cp.ones((nxl, nxl)), otherdims=(nt, ))
HOp = pylops_mpi.MPIVStack(ops=[Sop, ])
y_dist = HOp @ d_dist

The code is almost unchanged apart from the fact that we now use cupy arrays, PyLops-mpi will figure this out!

Finally, if NCCL is available, a cupy.cuda.nccl.NcclCommunicator can be initialized and passed to pylops_mpi.DistributedArray as follows:

from pylops_mpi.utils._nccl import initialize_nccl_comm

# Initilize NCCL Communicator
nccl_comm = initialize_nccl_comm()

# Create distributed data (broadcast)
nxl, nt = 20, 20
dtype = np.float32
d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt,
                                     base_comm_nccl=nccl_comm,
                                     partition=pylops_mpi.Partition.BROADCAST,
                                     engine="cupy", dtype=dtype)
d_dist[:] = cp.ones(d_dist.local_shape, dtype=dtype)

# Create and apply VStack operator
Sop = pylops.MatrixMult(cp.ones((nxl, nxl)), otherdims=(nt, ))
HOp = pylops_mpi.MPIVStack(ops=[Sop, ])
y_dist = HOp @ d_dist

Under the hood, PyLops-MPI use both MPI Communicator and NCCL Communicator to manage distributed operations. Each GPU is logically binded to one MPI process. In fact, minor communications like those dealing with array-related shapes and sizes are still performed using MPI, while collective calls on array like AllReduce are carried through NCCL

Note

The CuPy and NCCL backend is in active development, with many examples not yet in the docs. You can find many other examples from the PyLops Notebooks repository.

Supports for CuPy and NCCL#

In the following, we provide a list of modules (i.e., operators and solvers) with their current status (available on CPU+MPI, GPU+MPI, and GPU+NCCL):