PI: Torsten Hoefler (ETHZ)
January 1, 2014 - December 31, 2016
High-performance and scientific computing has always been at the forefront of adopting new architectures. For example, it was first to adopt features such as vectorization and multicore computing in the past. Using those features required expert programmers that were able to write (or refactor existing code into) vectorizable code, perform manual low-level vectorization using compiler instrinsics, or write multithreaded applications. This high entry barrier often led to scientific codes that left a huge optimization potential on the table. Over time, compiler technology improved to provide support in adopting those technologies using auto-vectorization, auto-parallelization, and language extensions or annotations such as OpenMP and OpenACC. Quickly growing power and energy requirements of compute chips led to dramatic changes in architectures. In fact, energy begins to become more important than performance because the operational cost of each server over its lifetime already exceeds its purchasing costs and the power consumption is the main limiting factor for very large-scale systems. Hardware specialization is currently the most efficient strategy for energy optimizations and two main modes of specialization exist today: latency-optimized cores (and memory) and throughput-optimized cores (and memory) . The former is needed for applications with little parallelism along the critical path while the latter enables highly efficient processing of applications with vast parallelism. Naturally, the energy overhead in hardware is much higher for the first category due to (energy-) expensive hardware caches, prefetching, and speculative execution. However, many scientific applications have large computational parts that map very well to the second category. Programming such systems today often involves the use of proprietary languages (such as CUDA), or libraries which may not deliver the needed performance or ease of use (e.g., OpenCL needs more than a hundred lines of code just for the initialization). OpenACC (or OpenMP 4.0) aims to address the programmer's needs by specifying pragmas for offloading parts of the computation to accelerator devices. All models support only a CPU-centric programming model with offloading of various parts of the computation. Some systems, however, may not have CPUs at all , or very weak CPUs  which mostly act as communication co-processor. Other systems have CPUs, which are nearly as powerful as the co-processors themselves. Future programming models and compilation systems should be oblivious of the details of a specific architecture but rather identify the two types of code such that those can be assigned optimally to the best execution unit. This project aims to develop techniques for the compilation of scientific codes on various target architectures such as GPUs, Co-Processors (e.g., Xeon Phi), and Multicore CPUs. A focal point is to identify code regions that can be mapped to high throughput cores and others that need low latency cores. The project will employ static and dynamic performance models and optimization strategies to identify such code pieces. In addition, we plan to investigate optimizations for the target accelerator architectures. Especially the complex memory subsystem of today's GPU architectures in which bank conflicts and partition camping can reduce the performance 100-fold  requires special attention. Since it is often hard to achieve the required precision in general compiler analyses, we also investigate annotations (e.g., pragmas) that drive the optimizing transformations or domain-specific languages that limit the program semantics to a transformable subset. All developments will be based on the LLVM compiler framework which provides front-ends for a variety of programming languages (e.g., C/C++ and Fortran) and back-ends for various architectures including x86 and NVIDIA GPUs (PTX). LLVM is a production-quality compiler suite that ensures the usability and maintainability of our software artifacts. For example, the PTX backend is developed by NVIDIA itself. The specific goals of this project are:
The project aims to strengthen the knowledge about low-level runtime and accelerator compilation systems in Switzerland. Both techniques are key in developing a strong competence in high-performance computing for the PASC community. Mastering such techniques can lead to significant improvements in energy efficiency, cost effectiveness and general scientific competitiveness.