3.1.3.1. Inter-node Implementation

template<class T>
class chase::mpi::ChaseMpiDLA : public chase::mpi::ChaseMpiDLAInterface<T>

A derived class of ChaseMpiDLAInterface which implements mostly the MPI collective communications part of ChASE-MPI targeting the distributed-memory systens with or w/o GPUs.

The computation in node are mostly implemented in ChaseMpiDLABlaslapack and ChaseMpiDLAMultiGPU. It supports both Block Distribution and Block-Cyclic Distribution schemes.

Public Functions

void preApplication(T *V, std::size_t locked, std::size_t block) override

This function is for some pre-application steps for the distributed HEMM in ChASE. These steps may vary in different implementations targetting different architectures. These steps can be backup of some buffers, copy data from CPU to GPU, etc.

Parameters
  • V1: a pointer to a matrix

  • locked: an integer indicating the number of locked (converged) eigenvectors

  • block: an integer indicating the number of non-locked (non-converged) eigenvectors

void apply(T alpha, T beta, std::size_t offset, std::size_t block, std::size_t locked) override

  • In ChaseMpiDLA, collective communication of HEMM operation based on MPI which ALLREDUCE the product of local matrices either within the column communicator or row communicator.

  • The workflow is:

    • compute B_ = H * C_ (local computation)

    • Allreduce(B_, MPI_SUM) (communication within colum communicator)

    • switch operation

    • compute C_ = H**H * B_ (local computation)

    • Allreduce(C_, MPI_SUM) (communication within row communicator)

    • switch operation

  • This function implements mainly the collective communications, while the local computation is implemented in ChaseMpiDLABlaslapack and ChaseMpiDLAMultiGPU, targetting different architectures

  • the computation of local GEMM invokes

bool postApplication(T *V, std::size_t block, std::size_t locked) override

Copy from buffer rectangular matrix v1 to v2. For the implementation of distributed-memory ChASE, this operation performs a copy from a matrix distributed within each column communicator and redundant among different column communicators to a matrix redundantly distributed across all MPI procs. Then in the next iteration of ChASE-MPI, this operation takes places in the row communicator…

Parameters
  • V: the target buff

  • block: number of columns to copy from v1 to v2

  • locked: number of converged eigenvectors.

void shiftMatrix(T c, bool isunshift = false) override

  • For ChaseMpiDLA, shiftMatrix is

    • implemented in nested loop for pure-CPU distributed-memory ChASE, and it is implemented in ChaseMpiDLABlaslapack

    • implemented on each GPU for multi-GPU distributed-memory ChASE, and it is implemented in ChaseMpiDLAMultiGPU

void applyVec(T *B, T *C) override

  • For ChaseMpiDLA, applyVec is implemented as with the functions defined in this class.

  • applyVec is used by ChaseMpi::Lanczos(), which requires the input arguments B and C to be vectors of size N_ which is redundantly distributed across all MPI procs.

  • Here are the details:

    • ChaseMpiDLA::preApplication(B, 0, 1)

    • ChaseMpiDLA::apply(One, Zero, 0, 1, 0)

    • ChaseMpiDLA::postApplication(C, 1, 0)

void axpy(std::size_t N, T *alpha, T *x, std::size_t incx, T *y, std::size_t incy) override

A BLAS-like function which performs a constant times a vector plus a vector.

Parameters
  • [in] N: number of elements in input vector(s).

  • [in] alpha: a scalar times on x in AXPY operation.

  • [in] x: an array of type T, dimension ( 1 + ( N - 1 )*abs( incx ).

  • [in] incx: storage spacing between elements of x.

  • [in/out]: y: an array of type T, dimension ( 1 + ( N - 1 )*abs( incy ).

  • [in] incy: storage spacing between elements of y.

void scal(std::size_t N, T *a, T *x, std::size_t incx) override

Base<T> nrm2(std::size_t n, T *x, std::size_t incx) override

A BLAS-like function which returns the euclidean norm of a vector.

Return

the euclidean norm of vector x.

Parameters
  • [in] N: number of elements in input vector(s).

  • [in] x: an array of type T, dimension ( 1 + ( N - 1 )*abs( incx ).

  • [in] incx: storage spacing between elements of x.

T dot(std::size_t n, T *x, std::size_t incx, T *y, std::size_t incy) override

A BLAS-like function which forms the dot product of two vectors.

Return

the dot product of vectors x and y.

Parameters
  • [in] N: number of elements in input vector(s).

  • [in] x: an array of type T, dimension ( 1 + ( N - 1 )*abs( incx ).

  • [in] incx: storage spacing between elements of x.

  • [in] y: an array of type T, dimension ( 1 + ( N - 1 )*abs( incy ).

  • [in] incy: storage spacing between elements of y.