*PI: Juerg Hutter (University of Zurich)*

*July 1, 2017 - June 30, 2020*

## Project Summary

Advanced algorithms for large scale electronic structure calculations are mostly based on processing multi-dimensional sparse data. Examples are sparse matrixmatrix multiplications in linear-scaling Kohn-Sham calculations or the efficient determination of the exact exchange energy. When going beyond mean field approaches, e.g. for Moller-Plesset perturbation theory, RPA and Coupled Cluster methods, or the GW methods, it becomes necessary to manipulate higher order sparse tensors. Very similar problems are also encountered in other domains, like signal processing, data mining, computer vision, and machine learning. The current project is concerned with the development of such a tensor library that should be easy to use for domain scientists and at the same time provides optimal performance and current and emerging hardware. The performance target is both, single node performance (including GPU and Intel MIC processors) and massive parallelism.

The starting point of the project is the realization that most tensor operations can be mapped on matrix multiplications. We can therefore base the development on the already existing domain library DBCSR, a distributed block sparse matrix multiplication library. The goals of the project will be achieved concentrating on three independent parts. First, a high level interface will be developed, that handles multi-dimensional tensors and arbitrary reduction operation between them by internally mapping the tensors to matrices and the reductions to matrix multiplications. Second, the 2.5D matrix multiplication algorithm in DBCSR will be extended to also optimally work in case for extreme skinny matrices. Such cases will be frequently encountered in high-rank tensor reductions but are not yet considered in DBCSR. Third, the single node performance dominated by a stream of small dense matrix multiplications is optimized by further developing a runtime system based on just-in-time compilation, automated kernel optimization and machine learning.

The project will be hosted by the computational chemistry group at University of Zurich (UZH). Software development will be done by one (partly two) postdoctoral fellow with computer science background. The embedding in an active group of domain scientists will guarantee immediate testing and application of the library. Besides continuous testing, other tasks like documentation and interfacing to domain specific applications will be performed by additional local domain scientists. This group will contain two additional postdoctoral fellows and a PhD student. They will provide part time support. UZH will also provide the basic development systems (multi-core nodes, GPU and Intel MIC). Large scale testing, scaling benchmarks, and early access to emerging new technology will be done in close collaboration with CSCS. The CSCS and a possible PASC computer science team will be involved in optimization and adaptation to system software, e.g. the RMA functionality of MPI libraries. We also plan to keep close contact with vendor scientists from Intel, Cray, and Nvidia.

The project will deliver a standalone numerical library for sparse tensor linear algebra. The library is developed under an open source license. Developing will be open and transparent using software hosting sites sourceforge and github. Dissemination of the library will be through different channels. First, there will be a library specific web-site, providing access to the source code, testing environment, documentation, and tutorials. Second, the library will be distributed through domain libraries like ELSI (ELectronic Structure Infrastructure). Third, the library will be part of the CP2K package.

The tensor linear algebra extension of the DBCSR sparse matrix multiplication library makes the full power of modern high-performance computing hardware available to high-level programming. Domain scientists will be able to devise new algorithms for advanced electronic structure methods and implement them into standard software easily. The library will guarantee highest single node performance and excellent scalability to large number of nodes. The fully self-contained library will allow to design easy but still for the domain meaningful test and benchmark suites. These tests can be used by computer scientists, and hardware as well as system software specialists to improve system software and co-design hardware. We hope that the library will initiate improvements in newer parts of the different MPI libraries (e.g. RMA calls), in node synchronization and memory architectures. The wide adaptation and use of the extended DBCSR library will ensure the further development of its co-libraries libxsmm and libcusmm.

Although the basic data structure using small dense blocks targets specifically electronic structure calculations with atom centered basis sets, we believe that the extended DBCSR library can have applications in other domains as well.