de
en
Schliessen
Detailsuche
Bibliotheken
Projekt
Impressum
Datenschutz
de
en
Schliessen
Impressum
Datenschutz
zum Inhalt
Detailsuche
Schnellsuche:
OK
Ergebnisliste
Titel
Titel
Inhalt
Inhalt
Seite
Seite
Im Dokument suchen
Vector length agnostic SIMD parallelism on modern processor architectures with the focus on Arm's SVE / Author: Bine Brank. Wuppertal, April 2023
Inhalt
Abstract
Acknowledgements
Contents
List of Figures
List of Tables
Abbreviations
Symbols
1 Introduction
1.1 Motivation
1.2 Single Instruction Multiple Data
1.3 Research goals
2 Background
2.1 SVE
2.1.1 Architectural state
2.1.2 Main features
2.1.2.1 Predication
2.1.2.2 Gather-load & Scatter-store
2.1.2.3 Reduction operations
2.1.2.4 Other
2.1.3 Code generation
2.1.3.1 Assembly
2.1.3.2 Intrinsic code
2.1.3.3 Compiler support
2.2 Applications
2.2.1 Benchmarks
2.2.2 OpenBLAS
2.2.3 GROMACS
2.2.4 GPAW
2.2.5 MiniFE
2.3 Hardware
2.3.1 Neoverse N1
2.3.1.1 Graviton 2
2.3.2 A64FX
2.4 Tools
2.5 Gem5 simulator
2.6 Related Work
3 Methodology
3.1 Auto-vectorization
3.2 Application setup
3.2.1 Computational patterns
3.2.2 Application hot spots
3.2.3 Benchmark preparation
3.2.4 Region Of Interest
3.3 Gem5 model
3.3.1 O3CPU
3.3.2 Core configuration
3.3.3 Cache configuration
3.3.4 Memory configuration
3.4 SVE static analysis
3.5 Architectural exploration
3.5.1 Design space
3.5.2 Microarchitectural analysis
3.5.3 Typical workflow
4 Porting of applications
4.1 OpenBLAS
4.1.1 BLAS3 general algorithm
4.1.2 Preserving the VLA feature
4.1.3 SVE assembly kernel
4.1.4 SVE intrinsic packing functions
4.1.5 Triangular matrices
4.2 GROMACS
4.2.1 Compute patterns
4.2.2 SIMD backend
4.2.3 SVE specific code
4.3 GPAW
4.3.1 Application hot-spots
4.3.2 Selected GPAW kernels
4.3.2.1 Laplace operator (bmgs_fd)
4.3.2.2 Electron density (construct_density)
4.3.3 Benchmark extraction
4.4 MiniFE
4.4.1 Application chracteristics
4.4.2 Sparse Matrix-Vector multiplication
4.4.3 Sliced ELLpack and SELL-C-
4.4.4 Intrinsic implementation (VLA SELL)
5 Results & analysis
5.1 SVE ISA exploitation
5.1.1 Auto-vectorization
5.1.2 Potential ISA extension
5.2 Validation of the Gem5 model
5.2.1 STREAM
5.2.2 Tinymembench
5.2.3 NAS parallel benchmarks
5.3 Selected applications
5.3.1 OpenBLAS
5.3.1.1 DGEMM benchmark
5.3.2 GROMACS
5.3.2.1 Nonbonded benchmark
5.3.2.2 Ribonuclease
5.3.3 GPAW
5.3.3.1 Function Bmgs_fd
5.3.3.2 Function Construct_density
5.3.3.3 A64FX behavior
5.3.4 MiniFE
5.3.4.1 SpMVM
6 Summary & conclusions
A SVE examples
A.1 DAXPY
A.2 Array permutation
A.3 Sum reduction
A.4 Complex arithmetics
A.5 strlen function
B Gem5 configuration
B.1 Execution units
B.2 Caches
Bibliography