*PI: Laurent Villard (EPFL)*

*Co-PIs: Stephan Brunner (EPFL), Claudio Gheller (CSCS), Olaf Schenk (USI), Romain Teyssier (UZH) , Trach-Minh Tran (EPFL)*

*July 1, 2014 - December 31, 2016*

## Project Summary

The main scientific motivation behind this proposal is the simulation of turbulence in magnetically confined plasmas from first-principles. While focused on this particular science goal, it is anticipated that the computational advances we intend to achieve will be of interest to our broader community. And, while the developments will be centred on the needs of our global gyrokinetic code ORB5, several aspects have the potential to impact other code developments as well. In particular, the RAMSES code used in Astrophysics, also based on the PIC method, is expected to benefit from this project as well. This project aims at assessing, developing, implementing, testing and adapting alternative strategies for parallelization and optimization of the ORB5 code. This code is characterized by its particle description of phase space and by the requirement to solve consistently for electromagnetic fields . As more and more physics is included in the code, the computational requirements are expected to increase by order(s) of magnitude. Also, in view of the current and expected future evolution of processor architecture and HPC systems, characterized by compute nodes having a heterogeneous structure (e.g. CPU+GPU or CPU+MIC), there is a need to investigate alternatives to the existing schemes now implemented in ORB5.

The bottlenecks and limitations are well known. They have both a parallelization part and an optimization part.

- The limitation on parallelization shows up for cases with simultaneously large number of processors and very large grid sizes. Among the communication-intensive operations are parallel data transpose operations required to perform Fourier transforms, and global sums of large grid data.
- The limitation on processor performance comes from the particle-to/from-grid operations inherent to the PIC approach, thus limiting the vectorization. Moreover, for very large grid sizes there is a thinning of particles in the grid, resulting in low arithmetic intensity. The previous HP2C project had already investigated some of the parallelization aspects, with a detailed code profiling. Kernel extraction and implementation of relatively simple measures resulted in a factor 2 overall performance gain for the global ORB5 code for the largest cases we could consider at the time: ITER-size simulations of ion scale turbulence. However, we strongly feel that more radical steps should be investigated. In the previous HP2C project, the fundamental parallelization strategy was not modified: domain decomposition in the toroidal direction and domain cloning, using flat MPI.

Here, we intend to investigate the following possibilities, which deal with particle-to/from-grid operations and field solver issues:

- Replace the global FFT operations (requiring large real space grid data transpose) and Fourier filtering with local calculations of only the Fourier modes in the filter. More cpu arithmetic operations will be needed; however, it is anticipated that these will be highly vectorizable.
- Replace calls to MPI reduce routines with an architecture-aware algorithm, in particular in view of large number of cores per node.
- Try to increase the scalability of the code with increasing number of clones, especially when this number exceeds the number of cores / node.
- Investigate the potential of accelerators (GPUs or MICs) for basic operations involved in PIC and/or Fourier methods. We note that Nvidia professionals have already expressed their interest in contributing to GPU-related developments of PIC codes. This will be done first on a simplified PIC model, in order to explore the potential and transfer the knowledge in a second stage to ORB5 and RAMSES.
- Replace the finite element representation in the two periodic coordinates (poloidal, toroidal) with a Fourier representation. This implies removing particle / grid operations (except for the radial coordinate) and FFTs altogether, at the expense of more cpu operations; again, these operations will be local and therefore scalable.
- Replace the finite element representation in one of the periodic coordinates (toroidal) and apply fieldaligned shape functions. Only 1D (and therefore local) FFT will be kept (in the poloidal coordinate), for filtering purposes. This would require a significant refactoring of the code.

These developments will certainly benefit from the interaction with HPC specialists at CSCS and elsewhere, with a prospective view to the new and emerging architectures. It is our clear intent to be open to new, and yet unforeseen, developments, and to be ready to adapt our strategy and tactics accordingly. It must be realized, though, that we intend to keep the ORB5 code as portable as it is now. The product we are aiming at is a code that performs well on several of the top-end platforms, rather than a code that performs even better but only on one specific platform.

The other product of this project is a set of procedures that have the potential to be applicable to other codes as well, beyond the specific gyrokinetic turbulence simulation of ORB5. Wherever possible and appropriate, utility libraries will be proposed and updated, with a view of facilitating the work of code developers of a wider scientific community. On this latter point, it should be mentioned that the Pis and co- PIs benefit also from integration into the Fusion Programme with several international collaborations and competitors.