Research


Our research activities aim at further improving both its functionality and its scalability. In addition to keeping up with the rapid new developments in parallel computing such as accelerators and partitioned global address space languages, research is also undertaken to expand the general understanding of parallel performance in simulation codes. The examples below summarize two ongoing projects aimed at increasing the expressive power of the analyses supported by Scalasca.

Time-series call-path profiling


As scientific parallel applications simulate the temporal evolution of a system, their progress occurs via discrete points in time. Accordingly, the core of such an application is typically a loop that advances the simulated time step by step. However, the performance behavior may vary between individual iterations, for example, due to periodically re-occurring extra activities or when the state of the computation adjusts to new conditions in so-called adaptive codes [1]. The figure below shows how much key performance metrics such as point-to-point communication may differ both across space and time.




Distribution of point-to-point communication time in the SPEC MPI2007 application 129.tera_tf across the process-iteration space. The more time is spent on message communication the more reddish the color.


To study such time-dependent behavior, Scalasca is being extended to distinguish individual iterations in profiles and event traces. However, even generating call-path profiles (as opposed to traces) separately for thousands of iterations may exceed the available buffer space --- especially when the call tree is large and more than one metric is collected. For this reason, a runtime approach for the semantic compression of call-path profiles based on incremental clustering of a series of single-iteration profiles was developed that scales in terms of the number of iterations without sacrificing important performance details [2]. This method, which will be integrated in future versions of Scalasca, offers low runtime overhead by using only a condensed version of the profile data when calculating distances and accounts for process-dependent variations by making all clustering decisions locally.


References
[1]

Zolt?n Szebenyi, Brian J. N. Wylie, Felix Wolf: SCALASCA Parallel Performance Analyses of SPEC MPI2007 Applications. In Proc. of the 1st SPEC International Performance Evaluation Workshop (SIPEW), volume 5119 of Lecture Notes in Computer Science, pages 99-123, Darmstadt, Germany, Springer, June 2008.
PDF        DOI        BibTeX 

[2]

Zolt?n Szebenyi, Felix Wolf, Brian J. N. Wylie: Space-Efficient Time-Series Call-Path Profiling of Parallel Applications. In Proc. of the ACM/IEEE Conference on Supercomputing (SC09), Portland, Oregon, ACM, November 2009.
PDF        DOI        BibTeX 


Identifying the root causes of wait-state formation


In general, the temporal or spatial distance between cause and symptom of a performance problem constitutes a major difficulty in deriving helpful conclusions from performance data. So just knowing the locations of wait states in the program is often insufficient to understand the reason for their occurrence. Building on earlier work by Meira, Jr. et al [1], the replay-based wait-state analysis is currently being extended in such a way that it attributes the waiting times to their root causes [2], as exemplified in the figure below. Typically, these root causes are intervals during which a process performs some additional activity not performed by its peers, for example as a result of insufficiently balancing the load.

In a subroutine of the Zeus MP/2 astrophysics code, several processes primarily arranged on a small hollow sphere within the virtual process topology (shown on the left) sign responsible for wait states arranged on the enclosing hollow sphere (shown on the right). Since the inner region of the topology carries more load than the outer region, processes at the rim of the inner region delay those farther outside.


However, excess workload identified as root cause of wait states usually cannot simply be removed. To achieve a better balance, optimization hypotheses drawn from such an analysis typically propose the redistribution of the excess load to other processes instead. Unfortunately, redistributing workloads in complex message-passing applications can have intricate side effects that may compromise the expected reduction of waiting times. Given that balancing the load statically or even introducing a dynamic load-balancing scheme constitute major code changes, they should ideally be performed only if the prospective performance gain is likely to materialize. Recent work [3] therefore concentrated on automatically predicting the effects of redistributing a given delay in a scalable way without altering the application itself and to determine the savings we can realistically hope for. Since the effects of such changes are hard to quantify analytically, they are simulated via a real-time replay of event traces after they have been modified to reflect the redistributed load.


References
[1]

Wagner Meira, Thomas J. LeBlanc, Virg?lio A. F. Almeida: Using cause-effect analysis to understand the performance of distributed programs. In SPDT '98: Proceedings of the SIGMETRICS symposium on Parallel and distributed tools, pages 101–111, New York, NY, USA, ACM, 1998.
DOI        BibTeX 

[2]

David B?hme, Markus Geimer, Felix Wolf, Lukas Arnold: Identifying the root causes of wait states in large-scale parallel applications. In Proceedings of the 39th International Conference on Parallel Processing (ICPP), San Diego, CA, September 2010.
PDF        DOI        BibTeX 

[3]

Marc-Andr? Hermanns, Markus Geimer, Felix Wolf, Brian J. N. Wylie: Verifying Causality Between Distant Performance Phenomena in Large-Scale MPI Applications. In Proc. of the 17th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pages 78-84, Weimar, Germany, IEEE Computer Society, February 2009.
PDF        DOI        BibTeX