Note
Go to the end to download the full example code.
Distributed Matrix Multiplication#
This example shows how to use the pylops_mpi.basicoperators.MPIMatrixMult
operator to perform matrix-matrix multiplication between a matrix \(\mathbf{A}\)
blocked over rows (i.e., blocks of rows are stored over different ranks) and a
matrix \(\mathbf{X}\) blocked over columns (i.e., blocks of columns are
stored over different ranks), with equal number of row and column blocks.
Similarly, the adjoint operation can be peformed with a matrix \(\mathbf{Y}\)
blocked in the same fashion of matrix \(\mathbf{X}\).
Note that whilst the different blocks of the matrix \(\mathbf{A}\) are directly
stored in the operator on different ranks, the matrix \(\mathbf{X}\) is
effectively represented by a 1-D pylops_mpi.DistributedArray
where
the different blocks are flattened and stored on different ranks. Note that to
optimize communications, the ranks are organized in a 2D grid and some of the
row blocks of \(\mathbf{A}\) and column blocks of \(\mathbf{X}\) are
replicated across different ranks - see below for details.
from matplotlib import pyplot as plt
import math
import numpy as np
from mpi4py import MPI
from pylops_mpi import DistributedArray, Partition
from pylops_mpi.basicoperators.MatrixMult import MPIMatrixMult
plt.close("all")
We set the seed such that all processes can create the input matrices filled with the same random number. In practical application, such matrices will be filled with data that is appropriate that is appropriate the use-case.
np.random.seed(42)
We are now ready to create the input matrices \(\mathbf{A}\) of size \(M \times k\) \(\mathbf{A}\) of size and \(\mathbf{A}\) of size \(K \times N\).
N, K, M = 4, 4, 4
A = np.random.rand(N * K).astype(dtype=np.float32).reshape(N, K)
X = np.random.rand(K * M).astype(dtype=np.float32).reshape(K, M)
The processes are now arranged in a \(P' \times P'\) grid, where \(P\) is the total number of processes.
We define
and the replication factor
Each process is therefore assigned a pair of coordinates \((r,c)\) within this grid:
For example, when \(P = 4\) we have \(P' = 2\), giving a 2×2 layout:
This is obtained by invoking the
pylops_mpi.basicoperators.MPIMatrixMult.active_grid_comm
method, which is also
responsible to identify any rank that should be deactivated (if the number
of rows of the operator or columns of the input/output matrices are smaller
than the row or columm ranks.
base_comm = MPI.COMM_WORLD
comm, rank, row_id, col_id, is_active = MPIMatrixMult.active_grid_comm(base_comm, N, M)
print(f"Process {base_comm.Get_rank()} is {'active' if is_active else 'inactive'}")
if not is_active: exit(0)
# Create sub‐communicators
p_prime = math.isqrt(comm.Get_size())
row_comm = comm.Split(color=row_id, key=col_id) # all procs in same row
col_comm = comm.Split(color=col_id, key=row_id) # all procs in same col
Process 0 is active
At this point we divide the rows and columns of \(\mathbf{A}\) and \(\mathbf{X}\), respectively, such that each rank ends up with:
\(A_{p} \in \mathbb{R}^{\text{my_own_rows}\times K}\)
\(X_{p} \in \mathbb{R}^{K\times \text{my_own_cols}}\)
We are now ready to create the pylops_mpi.basicoperators.MPIMatrixMult
operator and the input matrix \(\mathbf{X}\)
Aop = MPIMatrixMult(A_p, M, base_comm=comm, dtype="float32")
col_lens = comm.allgather(my_own_cols)
total_cols = np.sum(col_lens)
x = DistributedArray(global_shape=K * total_cols,
local_shapes=[K * col_len for col_len in col_lens],
partition=Partition.SCATTER,
mask=[i % p_prime for i in range(comm.Get_size())],
base_comm=comm,
dtype="float32")
x[:] = X_p.flatten()
We can now apply the forward pass \(\mathbf{y} = \mathbf{Ax}\) (which effectively implements a distributed matrix-matrix multiplication \(Y = \mathbf{AX}\)) Note \(\mathbf{Y}\) is distributed in the same way as the input \(\mathbf{X}\).
y = Aop @ x
Next we apply the adjoint pass \(\mathbf{x}_{adj} = \mathbf{A}^H \mathbf{x}\) (which effectively implements a distributed matrix-matrix multiplication \(\mathbf{X}_{adj} = \mathbf{A}^H \mathbf{X}\)). Note that \(\mathbf{X}_{adj}\) is again distributed in the same way as the input \(\mathbf{X}\).
To conclude we verify our result against the equivalent serial version of the operation by gathering the resulting matrices in rank0 and reorganizing the returned 1D-arrays into 2D-arrays.
# Local benchmarks
y = y.asarray(masked=True)
col_counts = [min(blk_cols, M - j * blk_cols) for j in range(p_prime)]
y_blocks = []
offset = 0
for cnt in col_counts:
block_size = N * cnt
y_block = y[offset: offset + block_size]
if len(y_block) != 0:
y_blocks.append(
y_block.reshape(N, cnt)
)
offset += block_size
y = np.hstack(y_blocks)
xadj = xadj.asarray(masked=True)
xadj_blocks = []
offset = 0
for cnt in col_counts:
block_size = K * cnt
xadj_blk = xadj[offset: offset + block_size]
if len(xadj_blk) != 0:
xadj_blocks.append(
xadj_blk.reshape(K, cnt)
)
offset += block_size
xadj = np.hstack(xadj_blocks)
if rank == 0:
y_loc = (A @ X).squeeze()
xadj_loc = (A.T.dot(y_loc.conj())).conj().squeeze()
if not np.allclose(y, y_loc, rtol=1e-6):
print("FORWARD VERIFICATION FAILED")
print(f'distributed: {y}')
print(f'expected: {y_loc}')
else:
print("FORWARD VERIFICATION PASSED")
if not np.allclose(xadj, xadj_loc, rtol=1e-6):
print("ADJOINT VERIFICATION FAILED")
print(f'distributed: {xadj}')
print(f'expected: {xadj_loc}')
else:
print("ADJOINT VERIFICATION PASSED")
FORWARD VERIFICATION PASSED
ADJOINT VERIFICATION PASSED
Total running time of the script: (0 minutes 0.004 seconds)