*ExaTrain: towards Exascale training for Machine Learning *

*PI: Rolf Krause (Università della Svizzera italiana)*

*Co-PIs: Martin Jaggi *

*July 1, 2021 – June 30, 2024*

### Project Summary

ExaTrain aims at providing massively parallel training methods for deep neural networks. We will build a training framework that enables the parallel training of deep neural networks in data and model space. Moreover, we will provide parallel training techniques as proof of concept by transferring parallelization techniques from HPC to machine learning.

Nowadays, deep neural networks (DNNs) are routinely used in a wide range of application areas and scientific fields. DNNs have not only developed into a powerful tool for data analysis, e.g., for image analysis in medicine or natural language processing, but are also increasingly used in the context of numerical simulation and prediction.

The accuracy, richness, and quality of the model represented by a DNN is tightly coupled to its size. Consequently, network sizes have grown considerably over the last years, leading to networks with hundreds of billion weights nowadays — and the sizes are expected to grow further. Training these networks is arduous and training times of weeks are not uncommon for large networks. The most common training methods are still first order approaches, i.e., stochastic gradient methods, and their accelerated variants, mostly on single GPUs. This is partially remedied by using hardware acceleration, highly sophisticated implementations, and parameter choices. However, on the long run this is not sustainable in terms of scalability, as the growth in computational complexity and size of DNNs outpaces the resulting gains.

Slow and unscalable training is not only hindering progress on the application and the research side, it also has a significant environmental impact. It is estimated that by 2030 computing and communication technology will consume between 8 to 20 percent of the world’s electricity [3]. For example, in 2018, 200 terawatt hours were used, which is almost four times the consumption of Switzerland, with a substantial part of this energy being consumed in the training of neural networks.

In order to allow for the efficient training of ever growing networks, scalable and massively parallel training methods have to be developed and implemented. In response and in order to deal with large models and to allow for training ”at scale”, recently parallel approaches to network training have been and are being investigated more deeply. In general, data as well as model parallelism can be considered, i.e. a decomposition of the sample space or of the weight space, respectively. Frameworks such as DeepSpeed implement variants of both approaches. They are, however, based on first order methods as the straightforward usage of second order approaches is hampered by the size of the networks and by the eventually required massive parallelism. Thus, they allow for significant increase in speed, but by design they cannot provide a scalable approach to training. A notable step forward is the XBraid package, which implements layer parallelism and a certain multilevel-approach. Eventually, any algorithmic software needs to be provided in such a way that its performance can be easily accessed by the AI user-community, which is used to intuitive frameworks such as PyTorch or TensorFlow.

ExaTrain addresses these points by, firstly, providing an accessible framework for implementing parallel training strategies to the AI community, and, secondly, porting scalable strategies from domain decomposition and multigrid to machine learning as a first realization of upcoming parallel training strategies.

ExaTrain will implement the necessary building blocks for massively parallel training strategies as an ”add on” to existing frameworks for neural networks. In this way, ExaTrain will not only allow to transfer the scalability and efficiency of established methods from HPC to the training of deep networks in machine learning. It will also allow for realizing new and yet unknown training strategies, may they be based on existing domain decomposition methods or not.

From the HPC side it is well known that a natural framework for large scale parallel problems can be found in domain decomposition (DD) and multigrid (MG) methods. Originally developed for the scalable solution of elliptic PDEs – or, equivalently, the minimization of convex minimization problems – they have been extended in the last decades to a wide class of problems. This includes in particular the minimization of non-convex and non-smooth functions, for which nowadays scalable DD and MG methods such as RMTR, G-ASPIN, time-parallel approaches, and MG/OPT are available. Recently, these methods have been applied highly successfully to the training of neural networks, by exploiting the fact the training is in fact a non-convex minimization problem, i.e., minimization of the loss functions.

We emphasize that in order to achieve scalability, a simple distribution of the network and the computational tasks is not sufficient. In fact, large scale training requires a combination of algorithmic scalability and data and model parallelism. The multilevel decompositions are not only known to provide this algorithmic scalability, they also serve as a preconditioner for ill-conditioned datasets – respectively as a variance reduction method –, thereby increasing robustness and speed of the training process.

The two prospective PASC postdoctoral researchers in ExaTrain will collaborate tightly with the postdoctoral researcher and two PhD students employed on the recent SNF project ”ML2 — Multilevel and Domain Decomposition methods for Machine Learning”. While the researchers in ML2 will focus more on algorithmic and methodological aspects, the PASC-researcher will focus on the development of the software framework, on data management, and on communication aspects, focusing on an open and flexible framework for parallel training.

On the software side, we will use the publicly available library Utopia, which is developed by the applicants at ICS in Lugano together with CSCS. Furthermore, ExaTrain will profit from a cooperation with the authors of the XBraid package. The modular structure of Utopia will serve as basis for implementing general decomposition approaches in data as well as in the model space. In particular for the latter, currently no libraries are available, which would allow for massive parallelism and for harnessing the accumulated power of future GPU-based supercomputers for training.

In ExaTrain, we will use a python extended version of Utopia, i.e., UtoPya, as an interface for existing frameworks such as PyTorch, thereby combining the efficiency and functionality of established AI tools with the new abstract DD functionality into our new ExaTrainframework. We will follow a similar approach as implemented in DeepSpeed: The model is defined as in PyTorch, and possible additional decomposition parameters are added to the definition. Then, UtoPya is called, which acts as a DD interface for PyTorch and passes eventual model parameters to the PyTorch instances.

A central part of ExaTrain will be a tight cooperation with the Department of Radiology of the EOC at the cantonal hospital of Ticino. As a challenging use case for our choice of parallel training methods, we will test and improve our methods and implementation on radiological datasets of fractured bones. Additionally, standard datasets will be used for testing and validation.

The ExaTrain framework will be open for user defined or individually adapted strategies and will be made available to the community.