Efficient QCD+QED Simulations with openQ*D software
PI: Marina Krstic Marinkovic (ETH Zurich)
Co-PIs: Isabel Campos, Nuno Cardoso, Michele Della Morte, Piotr Korcyl
July 1, 2021 – June 30, 2024
The proposed project will speed-up the generation of gauge configurations in the simultaneous simulations of quantum electrodynamics (QED, quantum field theory of electromagnetism) and quantum Chromodynamics (QCD, theory of strong interaction) with the open-source openQ*D-1.0 code , which will later be used to explore a variety of observables relevant for new physics searches using the European pre-exascale supercomputers, such as Large Unified Modern Infrastructure (LUMI) in Kajaani, Finland and the next generation CSCS systems. The aimed acceleration will be achieved by optimizing relevant modules of the openQ*D-1.0 code for GPU-accelerated supercomputing platforms, with a particular focus on enabling support for GPUs (and potentially FPGAs) from various vendors.
Lattice QCD is computationally intensive approach to solving the theory of strong interaction. In this project both QCD and QED will be formulated on a grid of lattice points in four dimensional space-time. Lattice QCD practitioners have been known to design and build custom computers, and consequently lattice QCD applications have traditionally been among the first ones to be ported to new high performance computing architectures. The openQ*D-1.0 code, which is the target for optimization in this project, has been developed by the RC? collaboration (PI and one of the co-PIs are among its developers). The code has recently been made available under the GNU General Public License . It is an extension of the openQCD-1.6 code  for QCD and its signature is the use of C? boundary conditions [16–19] which allow for a local and gauge-invariant formulation of QED in finite volume and in the charged sector of the theory . Both QCD and full QCD+QED configurations are currently being generated with the openQ*D-1.0 code at various HPC centers across Europe.
The programs for the production of gauge field ensembles and measurements are highly optimized for machines with current x86-64 processors, but will run correctly on any system that complies with the ISO C89 and the MPI 1.2 standards. The code is structured to ensure a very good data locality and an excellent strong and weak scaling has been documented on modern multi-CPU architectures (cf. sec. 2.1. of the proposal narrative). Nevertheless, the performance of the programs is mainly limited by data movement, i.e., the memory-to-processor bandwidth and network latency and we plan to improve on this by further optimizing the execution on the GPU-accelerated nodes. The workflow of the proposed project envisages that the new GPU code is written using architecture-independent programming models such as Kokkos  and SYCL , in order to secure portability across various GPU architectures. Lessons learned by the lattice community using different discretization of the dslash stencil operator will be a valuable input to these considerations . The openQ*D-1.0 code has a highly-optimized lattice Dirac operator, which will be the first target module for GPU-acceleration. The inversion of the Dirac operator constitutes the bulk of the calculation in the commonly performed configuration generation and measurement runs. The code allows for a choice of several different solvers (CGNE, MSCG, SAP+GCR, DFL+SAP+GCR), and in particular the inversions of the Dirac matrix for light quarks (the most demanding part of lattice QCD calculations) are performed in openQ*D-1.0 code with a deflated SAP-preconditioned GCR solver.
The work towards developing innovative algorithmic and high performance computing solutions needed to obtain precise predictions of non-perturbative phenomena studied in current high energy physics experiments will be achieved through three work packages. The first work package will kick off the project with a dedicated virtual workshop to exchange experiences between lattice QCD and HPC community on developing software for various GPU architectures. Continuous exchange between lattice QCD community and HPC experts from other disciplines through conferences is envisaged. Additionally, we plan to report the final project outcomes on the physical workshop to be organized towards the end of the funding period. The second work package will explore several strategies to speed up the execution time of the openQ*D code-based programs on the new architectures, with the particular focus to find an optimal way to enable efficient porting of the most expensive part (in CPU time) in QCD+QED simulations to the European pre-exascale supercomputers, such as Large Unified Modern Infrastructure (LUMI) in Kajaani, Finland or the future CSCS systems (the successor of Piz Daint supercomputer). This work package will produce the essential input for one of the main outcomes of the project, the first Tier-0 application, which is planned for the second half of 2022. In case that porting of solely the Dirac module of the openQ*D code to GPUs does not give satisfactory speed up, the risk mitigation strategy has been developed, such that alternative modules of the code can be accelerated before the follow-up Tier-0 application. Parallel to the software developments towards GPU architectures that are a central point of this project, further algorithmic improvements will be explored within the third work package, including but not limited to further development of block Krylov solvers and additional strategy for the parallel implementation molecular dynamics algorithm.
The search for efficient computational capacity in highly heterogeneous environments mixing different types of hardware requires fundamental changes to the development process of scientific codes to attain the required flexibility, scalability, reliability and reproducibility. The exploitation of technologies based on the Continuous Integration/Continuous Delivery (CI/CD) paradigm will thus be one of the core strategies for this work.