5. Modules

5.1. Overview

The implementation of ChASE provides a stand-alone high-performance parallel library based on our original design of the Chebyshev accelerated subspace iteration algorithm. The ChASE library promises the portability to heterogeneous architectures and the easy integration into existing codes. This goal is achieved by separating the implementation of the ChASE algorithm from the required numerical kernels via an interface based on pure C++ abstract classes. Classes derived from this interface handle data distribution and (parallel) execution of each kernel. The required numerical kernels are based on Basic Linear Algebra Subprograms (BLAS)-3 compatible kernels, such as a (parallel) matrix-matrix multiplication and QR factorization. This modern “stand-alone” strategy grants ChASE an unprecedented degree of flexibility that makes the integration of this library in most application codes quite simple. ChASE efficiently uses available machine resources.

We give an UML class diagram as follows. This diagram uses the implementation of Chebyshev Filter, whose kernel is a series of Hermitian Matrix-Matrix Products (HEMM), as an example, to show the scheme of implementing of ChASE and porting to different architectures. This section will give the user an insight on how to set up their eigenproblems and solve them by ChASE on different computing architectures.

_images/ChASE_UML.jpg

As shown in the diagram above, all the implementation of ChASE take place within a C++ namespace chase.

5.2. Basic Classes

5.2.1. chase::Chase

The numerical kernels required by ChASE algorithm are defined in class chase::Chase. All the functions are defined as virtual functions, and further implementations are required targeting different computing architectures. It includes the following functionalities:

  • HEMM: Hermitian Matrix-Matrix Multiplication

  • QR: QR factorization

  • RR: Rayleigh-Ritz projection and small problem solver

  • Resd: compute the eigenpair residuals

  • Lanczos: estimate the bounds of user-interested spectrum by Lanczos eigensolver

  • LanczosDos: estimate the spectral distribution of eigenvalues

  • Swap: swap the two matrices of vectors used in the Chebyschev filter

  • Locking: locking the converged eigenpairs

  • Shift: shift the diagonal of matrix A used in the 3-terms recurrence relation implemented in the Chebyschev filter

  • etc ..

Note

For more details on the virtual kernels, please refer to Virtual Abstract of Numerical Kernels. Different parallel implementations of these virtual kernels can also be found Parallel implementations.

5.2.2. chase::Algorithm

The class chase::Algorithm has the awareness of the class chase::Chase, and it defines algorithmic implementation of ChASE using the defined virtual kernels in chase::Chase. It includes the functionalities:

  • Chebyshev filter

  • calculation of degree of the filter

  • Lanczos solver to estimate the bound of spectra

  • locking the converged Ritz pairs

  • etc ..

The function chase::Solve provides the implementation of ChASE algorithm by assembling the algorithms and numerical kernels implemented in chase::Chase and chase::Algorithm.

Note

This class implements the ChASE algorithm by the virtual functions, it cannot run in practice until the further implementations of these virtual functions are provided.

Note

The details of this class are only provided in the developer documentation, please refer to General algorithm.

5.2.3. chase::ChaseConfig

The class chase::Algorithm is aware of the class chase::ChaseConfig, which defines the functions to set different parameters of ChASE.

Besides setting up the standard parameters such as size of the matrix defining the eigenproblem, number of wanted eigenvalues, the public functions of this class initialize all internal parameters and allow the experienced user to set up the values of parameters of core functionalities (e.g. lanczos DoS). The aim is to influence the behavior of the library in special cases when the default values of the parameters return a sub-optimal efficiency in terms of performance and/or accuracy.

Note

For more details of all available functions, please refer to Configuration.

5.2.4. chase::ChasePerfData

This class defines the performance data for different algorithm and numerical kernels of ChASE, e.g., the floating operations of ChASE for given size of matrix and a required number of eigenpairs to be computed.

The chase::ChasePerfData class collects and handles information relative to the execution of the eigensolver. It collects information about

  • Number of subspace iterations

  • Number of filtered vectors

  • Timings of each main algorithmic procedure (Lanczos, Filter, etc.)

  • Number of FLOPs executed

The number of iterations and filtered vectors can be used to monitor the behavior of the algorithm as it attempts to converge all the desired eigenpairs. The timings and number of FLOPs are use to measure performance, especially parallel performance. The timings are stored in a vector of objects derived by the class template std::chrono::duration.

Note

For more details of all available functions, please refer to Performance.

5.2.5. chase::PerformanceDecoratorChase

This is a class derived from the chase::Chase which plays the role of interface for the kernels used by the library. All members of the chase::Chase class are virtual functions. These functions are re-implemented in the chase::PerformanceDecoratorChase class. All derived members that provide an interface to computational kernels are re-implemented by decorating the original function with time pointers which are members of the chase::ChasePerfData class. All derived members that provide an interface to input or output data are called without any specific decoration. In addition to the virtual member of the chase::Chase class, the chase::PerformanceDecoratorChase class has also among its public members a reference to an object of type chase::ChasePerfData. When using Chase to solve an eigenvalue problem, the members of the PerformanceDecoratorChase are called instead of the virtual functions members of the chase::Chase class. In this way, all parameters and counters are automatically invoked and returned in the correct order.

Note

For more details of all available functions, please refer to Performance.

5.3. Override of Virtual Functions

The exact implementation of numerical kernels used by ChASE are within the namespace chase::mpi. This namespace is defined inside the namespace chase, which provides the parallel implementation of ChASE based on MPI (and CUDA) by re-implementing the numerical kernels as virtual functions within the abstraction targeting homogeneous and heterogeneous architectures (multi-GPUs).

5.3.1. chase::mpi::ChaseMpiMatrices

The class chase::mpi::ChaseMpiMatrices defines the allocation of buffers for matrices and vectors in ChASE library for both non-MPI mode and MPI mode.

Note

For more details of all available functions, please refer to ChaseMpiMatrices.

5.3.2. chase::mpi::ChaseMpiProperties

The class chase::mpi::ChaseMpiProperties defines the construction of MPI environment and data distribution scheme (both Block Distribution and Block-Cyclic Distribution) for ChASE. This class has the awareness of the class chase::mpi::ChaseMpiMatrices. It will allocate the required buffer based on the configuration MPI environment and data distribution by using different constructors of chase::mpi::ChaseMpiMatrices.

Note

For more details of all available functions, please refer to ChaseMpiProperties.

5.3.3. chase::mpi::ChaseMpi

chase::mpi::ChaseMpi is a derived class of chase::Chase. This class gives an implementation of the virtual functions of chase::Chase class which defines the essential numerical kernels of ChASE algorithm. It is a templated class with two types required: an implementation of chase::mpi::ChaseMpiDLAInterface and the scalar type to be used in the applications. The numerical kernels defined in chase::Chase has further decoupled into Dense Linear Algebra operations (DLAs). Different objects of chase::mpi::ChaseMpi can be created targeting different computing platforms by selecting various derived classes of chase::mpi::ChaseMpiDLAInterface.

To be more precise, it is derived from the chase::Chase class which plays the role of interface for the kernels used by the library:

  • All members of the chase::Chase class are virtual functions. These functions are re-implemented in the chase::mpi::ChaseMpi class.

  • All the members functions of chase::mpi::ChaseMpi, which are the implementation of the virtual functions in class chase::Chase, are implemented using the DLA routines provided by the class chase::mpi::ChaseMpiDLAInterface.

  • The DLA functions in chase::mpi::ChaseMpiDLAInterface are also virtual functions, which are differently implemented targeting different computing architectures (sequential/parallel, CPU/GPU, shared-memory/distributed-memory, etc). In the class chase::mpi::ChaseMpi, the calling of DLA functions are indeed calling their implementations from different derived classes. Thus this ChaseMpi class is able to have customized implementation for various architectures.

  • The class chase::mpi::ChaseMpi has the awareness of the class chase::mpi::ChaseMpiMatrices and chase::mpi::ChaseMpiProperties.

    • For the shared-memory implementation, the constructor of chase::mpi::ChaseMpi takes an instance of chase::mpi::ChaseMpiMatrices as input

    • For the distributed-memory implementation of the class chase::mpi::ChaseMpi, the setup of MPI environment and communication scheme, and the distribution of data (matrix, vectors) across MPI nodes are following the chase::mpi::ChaseMpiProperties class, the distribution of matrix can be either Block or Block-Cyclic scheme. The required buffers are allocated during the construction of an object of chase::mpi::ChaseMpiProperties.

Note

For more details of all available functions, please refer to ChaseMpi.

5.3.4. DLAs for shared-memory architectures

The DLAs for shared-memory architectures with or without GPU are implemented in the classes chase::mpi::ChaseMpiDLACudaSeq and chase::mpi::ChaseMpiDLABlaslapackSeq, respectively.

5.3.5. DLAs for distributed-memory architectures

For the implementation of DLAs for distributed-memory architectures, they have been further decoupled into two layers:

  • the first layer is for the collective communication between different computing nodes

  • the second layer is for the implementation of local computation within each node

    • for homogeneous systems with CPUs-only, the local computation takes place on each individual MPI processor, with potential parallelization of multi-threading, e.g., with OpenMP.

    • for the heterogeneous systems with GPUs, some local computation takes place on each individual MPI processor, and more intensive computation are offloaded to each GPU bound to relevant MPI processor.

The local computations with or without GPUs are implemented in the classes chase::mpi::ChaseMpiDLABlaslapack and chase::mpi::ChaseMpiDLAMultiGPU, respectively.

The collective communication layer is shared between the distributed memory ChASE with or without GPU support, which is implemented in the class chase::mpi::chaseMpiDLA. This class takes an instance of chase::mpi::ChaseMpiDLAInterface, either chase::mpi::ChaseMpiDLABlaslapack or chase::mpi::ChaseMpiDLAMultiGPU as input. In this way, it is able to access to different implementations of local computation kernels.

Note

When an instance of chase::mpi::ChaseMpi is constructed for distributed-memory systems, one of its template parameter should be provided either chase::mpi::ChaseMpiDLABlaslapack and chase::mpi::ChaseMpiDLAMultiGPU. Then a instance of the class chase::mpi::ChaseMpiDLA will the created with the selected implementations of local computations kernels. In this way, ChASE is able to be ported to different computation architectures.