Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Lehrveranstaltungen
heiCO
Ressourcen	Fachschaft
Studien-FAQ	Technik-FAQ

Problems with Data

Module Description

Course	Module Abbreviation	Credit Points
BA-2010[100%\|75%]	CS-CL	6 LP
BA-2010[50%]	BS-CL	6 LP
BA-2010[25%]	BS-AC, BS-FL	4 LP
BA-2010	AS-CL	8 LP
Master	SS-CL-TAC	8 LP

Lecturer	Katja Markert
Module Type	Vorlesung / Übung
Language	English
First Session	17.04.2025
Time and Place	We. 10:15-11:45, INF 327 / SR 6 Th. 10:15-11:45, INF 327 / SR 4
Commitment Period	tbd.

Prerequisite for Participation

For MA students: none.
BA students:
- ECL
- Programming I
- Programming II or Experimente Gestalten mit Maschinellem Lernen or similar additional knowledge about algorithms, programming and machine learning is good but not strictly necessary.

Participants

All advanced Bachelor students and all Master students. Students from Computer Science, Mathematics or Scientific computing with Anwendungsgebiet Computational Linguistics are welcome.

Assessment

Active Participation, including leading discussions, contributing to discussions, demonstrating solutions to exercises in class. Therefore there is an attendance requirement for exercise and discussion sessions. There is no attendance requirement for lecture sessions.
4-5 Exercises
Presentation
Written Exam

Active participation and passing the exercises is a prerequisite for exam participation. The mark will be a weighted average of the presentation mark and the exam mark. Students that do the module as a Hauptseminar will get a more complex presentation topic and a somewhat harder and/or longer exam (and potentially exercise sheets).

Content

In this seminar we will look at various problems that arise with training and test data in NLP. We will look at common pitfalls, why you cannot necessarily believe state-of-the-art results and how to stress test both your data and your systems. The course includes both data construction and investigation methods as well as methods for identifying data problems. It will go beyond standard practices that you all know such as training/test splits, significance test etc. or tackle problems which are not necessarily statistical.

In particular, we will include or select from the following topics:

Data Sampling, including methods for sampling, analysis of sample sizes and power, use of opportunistic and silver data.
Pretraining Data for LLMs: toxicity, data contamination, deduplication, methods to examine pretraining data
Data Annotation for supervised finetuning: annotation methods (expert annotation, crowd-sourcing, LLM annotation etc.), measures for inter-annotator agreement,impact of item order on annotation, impact of annotator bias, learning with annotation disagreement or noise
Data bias and data artefacts: stress tests, adversarial data, challenge datasets, Clever Hans phenomena, counterfactual datasets
Synthetic data and data such as OpenThoughts for reinforcement learning for reasoning

Examples will mainly come from the realm of natural language inference, reasoning and maths, summarization, sentiment and hate speech, large language modelling

The course is suitable for advanced bachelor students (at least after Orientierungspruefung, better 3rd semester onwards) and all Master students. It is both suitable for primary linguistic interests as well as ML/algorithm interest.

The course will in the first part be structured as lecture with exercise classes and the second part will also include student presentations.

Schedule

Datum

Sitzung

Materialien

Literature

To be announced in first week of term.