GPU Support#
Overview#
PyLops-mpi supports computations on GPUs leveraging the GPU backend of PyLops. Under the hood,
CuPy (cupy-cudaXX>=v13.0.0
) is used to perform all of the operations.
This library must be installed before PyLops-mpi is installed.
Note
Set environment variable CUPY_PYLOPS=0
to force PyLops to ignore the cupy
backend.
This can be also used if a previous (or faulty) version of cupy
is installed in your system,
otherwise you will get an error when importing PyLops.
The pylops_mpi.DistributedArray
and pylops_mpi.StackedDistributedArray
objects can be
generated using both numpy
and cupy
based local arrays, and all of the operators and solvers in PyLops-mpi
can handle both scenarios. Note that, since most operators in PyLops-mpi are thin-wrappers around PyLops operators,
some of the operators in PyLops that lack a GPU implementation cannot be used also in PyLops-mpi when working with
cupy arrays.
Moreover, PyLops-MPI also supports the Nvidia’s Collective Communication Library (NCCL) for highly-optimized collective operations, such as AllReduce, AllGather, etc. This allows PyLops-MPI users to leverage the proprietary technology like NVLink that might be available in their infrastructure for fast data communication.
Note
Set environment variable NCCL_PYLOPS_MPI=0
to explicitly force PyLops-MPI to ignore the NCCL
backend.
However, this is optional as users may opt-out for NCCL by skip passing cupy.cuda.nccl.NcclCommunicator to
the pylops_mpi.DistributedArray
Example#
Finally, let’s briefly look at an example. First we write a code snippet using
numpy
arrays which PyLops-mpi will run on your CPU:
# MPI helpers
comm = MPI.COMM_WORLD
rank = MPI.COMM_WORLD.Get_rank()
size = MPI.COMM_WORLD.Get_size()
# Create distributed data (broadcast)
nxl, nt = 20, 20
dtype = np.float32
d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt,
partition=pylops_mpi.Partition.BROADCAST,
engine="numpy", dtype=dtype)
d_dist[:] = np.ones(d_dist.local_shape, dtype=dtype)
# Create and apply VStack operator
Sop = pylops.MatrixMult(np.ones((nxl, nxl)), otherdims=(nt, ))
HOp = pylops_mpi.MPIVStack(ops=[Sop, ])
y_dist = HOp @ d_dist
Now we write a code snippet using cupy
arrays which PyLops will run on
your GPU:
# MPI helpers
comm = MPI.COMM_WORLD
rank = MPI.COMM_WORLD.Get_rank()
size = MPI.COMM_WORLD.Get_size()
# Define gpu to use
cp.cuda.Device(device=rank).use()
# Create distributed data (broadcast)
nxl, nt = 20, 20
dtype = np.float32
d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt,
partition=pylops_mpi.Partition.BROADCAST,
engine="cupy", dtype=dtype)
d_dist[:] = cp.ones(d_dist.local_shape, dtype=dtype)
# Create and apply VStack operator
Sop = pylops.MatrixMult(cp.ones((nxl, nxl)), otherdims=(nt, ))
HOp = pylops_mpi.MPIVStack(ops=[Sop, ])
y_dist = HOp @ d_dist
The code is almost unchanged apart from the fact that we now use cupy
arrays,
PyLops-mpi will figure this out!
Finally, if NCCL is available, a cupy.cuda.nccl.NcclCommunicator
can be initialized and passed to pylops_mpi.DistributedArray
as follows:
from pylops_mpi.utils._nccl import initialize_nccl_comm
# Initilize NCCL Communicator
nccl_comm = initialize_nccl_comm()
# Create distributed data (broadcast)
nxl, nt = 20, 20
dtype = np.float32
d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt,
base_comm_nccl=nccl_comm,
partition=pylops_mpi.Partition.BROADCAST,
engine="cupy", dtype=dtype)
d_dist[:] = cp.ones(d_dist.local_shape, dtype=dtype)
# Create and apply VStack operator
Sop = pylops.MatrixMult(cp.ones((nxl, nxl)), otherdims=(nt, ))
HOp = pylops_mpi.MPIVStack(ops=[Sop, ])
y_dist = HOp @ d_dist
Under the hood, PyLops-MPI use both MPI Communicator and NCCL Communicator to manage distributed operations. Each GPU is logically binded to one MPI process. In fact, minor communications like those dealing with array-related shapes and sizes are still performed using MPI, while collective calls on array like AllReduce are carried through NCCL
Note
The CuPy and NCCL backend is in active development, with many examples not yet in the docs. You can find many other examples from the PyLops Notebooks repository.
Supports for NCCL Backend#
In the following, we provide a list of modules (i.e., operators and solvers) where we plan to support NCCL and the current status:
modules |
NCCL supported |
---|---|
/ |
|
Ongoing |
|
Ongoing |
|
Ongoing |
|
Ongoing |
|
Ongoing |
|
Ongoing |
|
Ongoing |
|
Ongoing |
|
Ongoing |
|
ISTA Solver |
Planned |
Complex Numeric Data Type for NCCL |
Planned |