3.1.3.1. Inter-node Implementation¶
-
template<class
T
>
classchase::mpi
::
ChaseMpiDLA
: public chase::mpi::ChaseMpiDLAInterface<T>¶ A derived class of ChaseMpiDLAInterface which implements mostly the MPI collective communications part of ChASE-MPI targeting the distributed-memory systens with or w/o GPUs.
The computation in node are mostly implemented in ChaseMpiDLABlaslapack and ChaseMpiDLAMultiGPU. It supports both
Block Distribution
andBlock-Cyclic Distribution
schemes.Public Functions
-
void
preApplication
(T *V, std::size_t locked, std::size_t block) override¶ This function is for some pre-application steps for the distributed HEMM in ChASE. These steps may vary in different implementations targetting different architectures. These steps can be backup of some buffers, copy data from CPU to GPU, etc.
- Parameters
V1
: a pointer to a matrixlocked
: an integer indicating the number of locked (converged) eigenvectorsblock
: an integer indicating the number of non-locked (non-converged) eigenvectors
-
void
apply
(T alpha, T beta, std::size_t offset, std::size_t block, std::size_t locked) override¶ In ChaseMpiDLA, collective communication of
HEMM
operation based on MPI which ALLREDUCE the product of local matrices either within the column communicator or row communicator.The workflow is:
compute
B_ = H * C_
(local computation)Allreduce
(B_, MPI_SUM) (communication within colum communicator)switch operation
compute
C_ = H**H * B_
(local computation)Allreduce
(C_, MPI_SUM) (communication within row communicator)switch operation
…
This function implements mainly the collective communications, while the local computation is implemented in ChaseMpiDLABlaslapack and ChaseMpiDLAMultiGPU, targetting different architectures
the computation of local
GEMM
invokesBLAS
GEMM
for pure-CPU distributed-memory ChASE, and it is implemented in ChaseMpiDLABlaslapack::apply()cuBLAS
GEMM
for multi-GPU distributed-memory ChASE, and it is implemented in ChaseMpiDLAMultiGPU::apply()
-
bool
postApplication
(T *V, std::size_t block, std::size_t locked) override¶ Copy from buffer rectangular matrix
v1
tov2
. For the implementation of distributed-memory ChASE, this operation performs acopy
from a matrix distributed within each column communicator and redundant among different column communicators to a matrix redundantly distributed across all MPI procs. Then in the next iteration of ChASE-MPI, this operation takes places in the row communicator…- Parameters
V
: the target buffblock
: number of columns to copy fromv1
tov2
locked
: number of converged eigenvectors.
-
void
shiftMatrix
(T c, bool isunshift = false) override¶ For ChaseMpiDLA,
shiftMatrix
isimplemented in nested loop for pure-CPU distributed-memory ChASE, and it is implemented in ChaseMpiDLABlaslapack
implemented on each GPU for multi-GPU distributed-memory ChASE, and it is implemented in ChaseMpiDLAMultiGPU
-
void
applyVec
(T *B, T *C) override¶ For ChaseMpiDLA,
applyVec
is implemented as with the functions defined in this class.applyVec
is used by ChaseMpi::Lanczos(), which requires the input argumentsB
andC
to be vectors of sizeN_
which is redundantly distributed across all MPI procs.Here are the details:
ChaseMpiDLA::preApplication(B, 0, 1)
ChaseMpiDLA::apply(One, Zero, 0, 1, 0)
ChaseMpiDLA::postApplication(C, 1, 0)
-
void
axpy
(std::size_t N, T *alpha, T *x, std::size_t incx, T *y, std::size_t incy) override¶ A
BLAS-like
function which performs a constant times a vector plus a vector.- Parameters
[in] N
: number of elements in input vector(s).[in] alpha
: a scalar times onx
inAXPY
operation.[in] x
: an array of typeT
, dimension( 1 + ( N - 1 )*abs( incx )
.[in] incx
: storage spacing between elements ofx
.[in/out]
: y: an array of typeT
, dimension( 1 + ( N - 1 )*abs( incy )
.[in] incy
: storage spacing between elements ofy
.
-
void
scal
(std::size_t N, T *a, T *x, std::size_t incx) override¶ For ChaseMpiDLA,
scal
is implemented by calling the one in ChaseMpiDLABlaslapack and ChaseMpiDLAMultiGPU.This implementation is the same for both with or w/o GPUs.
Parallelism is SUPPORT within node if multi-threading is actived
For the meaning of this function, please visit ChaseMpiDLAInterface.
-
Base<T>
nrm2
(std::size_t n, T *x, std::size_t incx) override¶ A
BLAS-like
function which returns the euclidean norm of a vector.- Return
the euclidean norm of vector
x
.- Parameters
[in] N
: number of elements in input vector(s).[in] x
: an array of typeT
, dimension( 1 + ( N - 1 )*abs( incx )
.[in] incx
: storage spacing between elements ofx
.
-
T
dot
(std::size_t n, T *x, std::size_t incx, T *y, std::size_t incy) override¶ A
BLAS-like
function which forms the dot product of two vectors.- Return
the dot product of vectors
x
andy
.- Parameters
[in] N
: number of elements in input vector(s).[in] x
: an array of typeT
, dimension( 1 + ( N - 1 )*abs( incx )
.[in] incx
: storage spacing between elements ofx
.[in] y
: an array of typeT
, dimension( 1 + ( N - 1 )*abs( incy )
.[in] incy
: storage spacing between elements ofy
.
-
void