This is a PhD thesis written by Bernhard Manfred Gruber (ORCID: 0000-0001-7848-1690) to achieve the academic degree Doktoringenieur (Dr.-Ing.) at the Technical University Dresden. It was submitted on 2024-08-20, defended on 2025-04-17, and published on 2025-10-20.
This repository contains all Latex sources to build the thesis, as well as all data and scripts to produce the included plots. Furthermore, the slide decks used for the status talk (some sort of pre-defence) as well as the final public defence are included. The final rendered document is available in two versions: The digital and the printed version.
The published thesis is also available on the Qucosa server of the Saxon State and University Library Dresden (SLUB): https://nbn-resolving.org/urn:nbn:de:bsz:14-qucosa2-989028
If not indicated otherwise, all materials published in this repository are licensed under the Creative Commons Attribution 4.0 International License.
Efficient parallel programs increasingly rely on memory-related optimizations and on respect for the target hardware's internal structure. This presents a challenge for portable codes running on a variety of architectures. A single-source approach is highly desirable while, ideally, retaining full control over target specific optimizations.
Memory-related optimizations are manifold, and generally require full control over data layout, memory access, storage format, memory allocation, and physical memory location. These aspects are ideally decoupled from a program and its data structures, and unified into a coherent zero-overhead abstraction layer.
By abstracting multidimensional arrays of nested structures, the foundation of data structure design, as indexable spaces, portable programs can be written against a generic interface. The low-level abstraction of memory access (LLAMA) implements this concept as a C++ abstraction library, underneath which every performance relevant aspect can be customized with minimal effort and without needing any change to user code.
LLAMA shows no overhead in most analyzed code bases, including real-world software, and generally produces machine code equivalent to manual data layout or SIMD implementations, while running portably on all relevant contemporary hardware architectures. The abstraction provided by LLAMA provides a solid foundation for systematic optimization, including instrumentation, profiling, and rapid data layout exploration.
LLAMA shows that a unification of existing memory optimization approaches is entirely possible, while making no compromises on the portability of code and supported hardware platforms, providing a novel tool for the development of high-performance C++ applications in a heterogeneous environment.