Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Lehrveranstaltungen
heiCO
Ressourcen	Fachschaft
Studien-FAQ	Technik-FAQ

Data Extraction, Contamination and Poisoning

Module Description

Course	Module Abbreviation	Credit Points
BA-2010[100%\|75%]	CS-CL	6 LP
BA-2010[50%]	BS-CL	6 LP
BA-2010	AS-CL	8 LP
Master	SS-CL-TAC	8 LP

Lecturer	Katja Markert
Module Type	Proseminar / Hauptseminar
Language	English
First Session	24.10.2024
Time and Place	Do, 10:15-11:45, INF 329 / SR 26
Commitment Period	tbd.

Participants

Suitable for all advanced Bachelor students with 50% or more CL. All CL Master students. Students from Data and Computer Science, Mathematics or Scientific Computing with Anwendungsgebiet Computational Linguistics are welcome.

Prerequisites for Participation

For Bachelor students:

ECL, Prog I and Prog II
Statistical Methods and/or Neural Networks

For Master Students: none, but is an ideal continuation for students who took "Problems with Data".

Assessment

Presentation
Discussion contributions: literature critiques, coassessment for presentations, exercises
For PS: Second presentation or small implementation project
For HS: Implementation project (can be conducted in a team)

Content

Given the often unknown or only partially known pretraining data of LLMs (even so-called open-source ones, which are often only open-weight ones), several questions arise which we tackle in this seminar:

Data Extraction: Can we extract parts of the training data from black-box LLMs without having access to the training corpora? Which implications does that have for privacy issues if the training data contains deannonymised data?
Data Contamination: How can we still be sure that our test data (be it linguistically annotated corpora, mathematical tests, coding questions or general knowledge questions) is not already part of the (pre)-training data? In particular we will look at ways that companies in their model cards tackle data contamination but also at how users can test whether a particular text or benchmark was included in the training data (for example, via membership inference attacks). In addition, we will talk about how to measure the impact of data contamination on LLM generalization performance.
Data Poisoning: Given that a large portion of pretraining data is web-scraped, how likely is it that the data can be "poisoned" by malicious actors?

Agenda

Date

Session

Materials