Benchmarking of a software system aims at determining software product metrics to make systems comparable, to show performance improvements, etc. Akin to scientific results, benchmarking results shall be repeatable and reproducible. Repeatability requires that the repetition of a benchmark in the same setup leads to (statistically) equivalent results [3, 5]. Reproducibility requires that the repetition by a third party (in the same or similar setup) leads to (statistically) equivalent results . However, the prerequisites for successful reproducibility of benchmarking results, in particular micro benchmarks, are currently not subject to research.
As part of its work, the Software Systems Engineering group performs benchmarking in different setups , e.g., [1, 3, 4]. On this page, we provide some details on the experiments in detail so that reproducibility is supported or becomes possible. We see this as our (current) contribution to reproducibility and will update this page based on more recent results.
Performance experiments on SPASS-meter:
- SPEC-JVM experiments
- Experiments on response-time fluctuations based on Waller 
 H. Eichelberger and K. Schmid. “Flexible resource monitoring of Java programs”. In: J. Syst. Softw. 93 (2014), pp. 163–186.
 J. Waller. “Performance Benchmarking of ApplicationMonitoring Frameworks”. PhD thesis. University of Kiel, 2014.
 A. Sass, "Performanz-Optimierung des Ressourcen-Messwerkzeugs SPASS-meter", MSc-Abschlussarbeit, Universität Hildesheim, 2016
 H. Eichelberger, A. Sass, K. Schmid, "From Reproducibility Problems to Improvements: A journey", Symposion on Software Performance (SSP'16), 2016 (to appear)
 K. Kanoun, Y. Crouzet, A. Kalakech and A.-L. Rugina, “Windows and Linux Robustness Benchmarks with Respect To Application Errneous Behavior”. In: Dependability Benchmarking for Computer Sys. 2008, pp. 227–254.