
Problems with Data
Module Description
Course | Module Abbreviation | Credit Points |
---|---|---|
BA-2010[100%|75%] | CS-CL | 6 LP |
BA-2010[50%] | BS-CL | 6 LP |
BA-2010[25%] | BS-AC, BS-FL | 4 LP |
BA-2010 | AS-CL | 8 LP |
Master | SS-CL-TAC | 8 LP |
Lecturer | Katja Markert |
Module Type |
|
Language | English |
First Session | 16.04.2025 |
Time and Place | We. 10:15-11:45, INF 327 / SR 6 Th. 10:15-11:45, INF 327 / SR 4 |
Commitment Period | tbd. |
Prerequisite for Participation
Participants
All advanced Bachelor students and all Master students. Students from Computer Science, Mathematics or Scientific computing with Anwendungsgebiet Computational Linguistics are welcome.
Assessment
- Active Participation, including leading discussions, contributing to discussions, demonstrating solutions to exercises in class. Therefore there is an attendance requirement for exercise and discussion sessions. There is no attendance requirement for lecture sessions.
- 4-5 Exercises
- Presentation
- Written Exam
Active participation and passing the exercises is a prerequisite for exam participation. The mark will be a weighted average of the presentation mark and the exam mark. Students that do the module as a Hauptseminar will get a more complex presentation topic and a somewhat harder and/or longer exam (and potentially exercise sheets).
Content
In this seminar we will look at various problems that arise with training and test data in NLP. We will look at common pitfalls, why you cannot necessarily believe state-of-the-art results and how to stress test both your data and your systems. The course includes both data construction and investigation methods as well as methods for identifying data problems. It will go beyond standard practices that you all know such as training/test splits, significance test etc. or tackle problems which are not necessarily statistical.
In particular, we will include or select from the following topics:
- Data Sampling, including methods for sampling, analysis of sample sizes and power, use of opportunistic and silver data.
- Pretraining Data for LLMs: toxicity, data contamination, deduplication, methods to examine pretraining data
- Data Annotation for supervised finetuning: annotation methods (expert annotation, crowd-sourcing, LLM annotation etc.), measures for inter-annotator agreement,impact of item order on annotation, impact of annotator bias, learning with annotation disagreement or noise
- Data bias and data artefacts: stress tests, adversarial data, challenge datasets, Clever Hans phenomena, counterfactual datasets
- Synthetic data and data such as OpenThoughts for reinforcement learning for reasoning
Examples will mainly come from the realm of natural language inference, reasoning and maths, summarization, sentiment and hate speech, large language modelling
The course is suitable for advanced bachelor students (at least after Orientierungspruefung, better 3rd semester onwards) and all Master students. It is both suitable for primary linguistic interests as well as ML/algorithm interest.
The course will in the first part be structured as lecture with exercise classes and the second part will also include student presentations.
Schedule
Datum | Sitzung | Materialien |
Literature
To be announced in first week of term.