Reproducible Machine Learning
Module Description
Course |
Module Abbreviation |
Credit Points
|
Bachelor CL |
AS-CL |
8 LP |
Master CL |
SS-CL, SS-TAC |
8 LP |
Seminar Informatik |
BA + MA |
4 LP |
Anwendungsgebiet Informatik |
MA |
8 LP |
Anwendungsgebiet SciComp |
MA |
8 LP |
Lecturer |
Stefan Riezler |
Module Type |
Seminar |
Language |
English |
First Session |
16.04.2024 |
Time and Place |
Tuesday, 14:15 - 15:45 Mathematikon SR11 |
Commitment Period |
tbd. |
Participants
Advanced Bachelor students and all Master students. Students from Computer Science or Scientific computing, especially those with application area Computational Linguistics are welcome.
Prerequisite for Participation
Good knowledge of statistical machine learning and experience in experimental work.
Assessment
- Regular and active participation (discussion of presented papers during seminar sessions)
- Oral presentation (30min presentation + 15min discussion, commitment for presentation by April 23, 2024, by email stating 3 ranked preferences)
- Implementation project and written report (required for 8 LP) or written term paper (required for 4 LP) (5 pages, accompanied by signed declaration of independence of authorship, deadline end of semester)
Content
Reproducibility of experimental results is one of the fundamental pillars of scientific research. If neither a reliable nor significant evaluation result can be obtained when replicating an experiment, the whole methodological foundation of the research result becomes questionable, even casting doubt on its validity.
In this seminar we will learn about several sources of nondeterminism that hamper reproducibility, and about statistical reliability and significance tests to allow us to analyze the inferential reproducibility of machine learning research. This means that instead of removing all sources of measurement noise, we will incorporate certain types of variance as irreducible conditions of measurement, and analyze their interaction with data properties, with the aim to draw inferences beyond particular instances of trained models.
We will show how to incorporate meta-parameter variations and data properties into statistical significance testing with Generalized Likelihood Ratio Tests (GLRTs), how to use variance component analysis based on Linear Mixed Effects Models (LMEMs) to analyze the contribution of noise sources to overall variance, and how to compute a reliability coefficient as indicator for reproducibility.
Schedule
Date |
Material |
Presenter |
16.4. |
Orga |
Riezler |
23.4. |
Introduction
Hagmann and Riezler, 2023. Towards Inferential Reproducibility of Machine Learning Research.
|
Riezler
slides
|
7.5. |
Sources of Nondeterminism: Implementation-Level
[1] Pham et al., 2021. Problems and opportunities in training deep learning software systems: An analysis of variance.
Further reading:
[2] Zhuang et al., 2022. Randomness in neural network training: Characterizing the impact of tooling.
|
[1] Asma Motmem |
14.5. |
Sources of Nondeterminism: Optimizer-Level
[4] Schmidt et al., 2021. Descending through a crowded valley - benchmarking deep learning optimizers.
Further reading:
[5] Ahn et al., 2022. Reproducibility in optimization: Theoretical framework and limits.
|
[4] Yu-Chuan Cheng |
21.5. |
Source of Nondeterminism: Metaparameter Variation
[6] Melis et al., 2018. On the state of the art of evaluation in neural language models.
Further reading:
[7] Reimers and Gurevych, 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging.
|
[6]Lisa Jockwitz
[7] Sophia Wikinger |
28.5. |
Sources of Nondeterminism: Evaluation Metrics
[8] Chen et al., 2022. Reproducibility issues for BERT-based evaluation metrics.
Further reading:
[9] Post, 2018. A call for clarity in reporting BLEU scores.
|
[8] Siddhant Tripathi
[9] Bingyu Guo |
4.6. |
Sources of Nondeterminism: Data Splits
[10] Sogaard et al., 2021. We need to talk about random splits.
Further reading:
[11] Gorman and Bedrick, 2019. We need to talk about standard splits.
|
[10] Lydia Körber
[11] Xinyue Cheng |
11.6. |
Reliability Measures: Bootstrap Confidence Intervals
[12] Agarwal et al., 2021. Deep reinforcement learning at the edge of the statistical precipice.
Further reading:
[13] Henderson et al., 2018. Deep reinforcement learning that matters.
|
[12] Marlon Dittes
[13] Hammad Aamer |
18.6. |
Reliability Measures: Variance Component Analysis and Intra-Class Correlation Coefficient
[14] Chapter 3 of Riezler and Hagmann, 2022. Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science.
Further reading:
[15] Ferro and Silvello, 2016. A general linear mixed models approach to study system component effects.
|
[14] Dana Simedrea
[15] Marko Lukosek |
25.6. |
Significance Testing: Abandon p-values?
[16] McShane et al., 2019. Abandon statistical significance.
Further reading:
[17] Colquhoun, 2017. The reproducibility of research and the misinterpretation of p-values.
|
[16] Muskan Hashim |
2.7. |
Significance Testing: Score Distribution Comparison
[18] Dror et al., 2019. Deep dominance - how to properly compare deep neural models.
Further reading:
[19] Ulmer et al., 2022. deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks.
|
[18] Yanxin Jia |
9.7. |
Significance Testing: Bootstrap and Randomization
[20] Clark et al, 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability.
Further reading:
[21] Sellam at al., 2022. The multiBERTs: BERT reproductions for robustness analysis.
|
[20] Paul Stefan Saegert |
16.7. |
Significance Testing: The Generalized Likelihood Ratio Test
[22] Chapter 4 of Riezler and Hagmann, 2022. Validity, Reliability,
and Significance: Empirical Methods for NLP and Data Science.
Further reading:
[23] Robertson and Kanoulas, 2012. On per-topic variance in IR evaluation.
|
[22] David Schwenke |
23.7. |
Implementation Project Discussion
Inferential Reproducibility Toolkit
|
Riezler |