The Scalasca performance toolset architecture
By M. Geimer, F. Wolf, B.J.N. Wylie, E. Ábrahám, D. Becker, B. Mohr.
Published in Concurrency and Computation: Practice and Experience, 22(6):702-719, April 2010.
Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers expands from generation to generation. As a consequence, supercomputing applications are required to harness much higher degrees of parallelism in order to satisfy their enormous demand for computing power.
IBM BlueGene/P in Jülich with 294,912 cores.
However, with today’s leadership systems featuring more than a hundred thousand cores, writing efficient codes that exploit all the available parallelism becomes increasingly difficult. Performance optimization is therefore expected to become an even more essential software-process activity, critical to the success of many simulation projects. The situation is exacerbated by the fact that the rising number of cores imposes scalability demands not only on applications but also on the software tools needed for their development.
Making applications run efficiently at larger scales is often thwarted by excessive communication and synchronization overheads. Especially during simulations of irregular and dynamic domains, these overheads are often enlarged by wait states that appear in the wake of load or communication imbalance when processes fail to reach synchronization points simultaneously. In particular, when trying to scale communication-intensive applications to large processor counts, such wait states can result in substantial performance degradation.
To address these challenges, Scalasca has been designed as a diagnostic tool to support the application optimization on highly scalable systems. Although also covering single-node performance via hardware-counter measurements, Scalasca mainly targets communication and synchronization issues, whose understanding is critical to scale applications to performance levels in the petaflops range. A distinctive feature of Scalasca is its ability to identify wait states that occur, for example, as a result of unevenly distributed workloads.