Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Lehrveranstaltungen
heiCO
Ressourcen	Fachschaft
Studien-FAQ	Technik-FAQ

Optimizing Data Usage in Neural Sequence-To-Sequence Learning

Module Description

Course	Module Abbreviation	Credit Points
BA-2010[100%\|75%]	CS-CL	6 LP
BA-2010[50%]	BS-CL	6 LP
BA-2010[25%]	BS-AC	4 LP
BA-2010	AS-CL	8 LP
Master	SS-CL, SS-TAC	8 LP

Lecturer	Tsz Kin Lam
Module Type	Hauptseminar
Language	English
First Session	28.10.2021
Time and Place	Thursday, 11:15-12:45, ~~Online~~ INF 325 ICL 020
Commitment Period	tba

Prerequisite for Participation

Good knowledge of statistical machine learning (e.g., by successful completion of courses "Statistical Methods for Computational Linguistics"; and/or "Neural Networks: Architectures and Applications for NLP";) and experience in experimental work (e.g., software project or seminar implementation project) and basic knowledge of Sequence- To-Sequence Learning.
Experience in deep learning libraries/frameworks such as PyTorch, OpenNMT, fairseq, and joeynmt

Assessment (Updated - see materials in the first week)

Regular and active participation: reading research papers and asking questions in class
Oral presentation of (a) selected paper(s)
Implementation project

Inhalt

Deep learning is the de facto standard for many classification tasks, e.g., natural language processing or image recognition. However, it is also notorious of being data hungry. This data hungry nature, together with the costly annotation process, has stimulated a lot of research on creating synthetic data, aka data augmentation. Another popular method is to create additional data by crawling data on web, aka data crawling. Both approaches allow substantial increases of training data at little cost. However, synthetic or crawled data can be noisy, e.g., due to misalignments between source and target sentences, or due to a domain mismatch between new data and original training data. This casts doubt on the benefits of such additional data to the final model performance, and is also the place where data selection comes into play.

The focus of this seminar is on optimizing data usage of neural sequence-to-sequence learning in text data. Participants will learn about the recent advances of data selection, data augmentation and their connections to multi-domain scenarios. The application focus will be sequence-to-sequence learning, especially machine translation.

Topics will include (but not limited to):

Data selection/filtering
Data augmentation and adversarial inputs
Generalization over multiple domains

Module Overview

Agenda

Date (dd.mm.yyyy)	Session	Presenter
28.10.2021	Organization (updated)	-
04.11.2021	Introduction of papers	-
11.11.2021	Submission of preferences	-
18.11.2021	Overcoming Catastrophic Forgetting During Domain Adaptation of NeuralMachine Translation Improving the Quality Trade-Off for Neural Machine Translation Multi-Domain Adaptation	Reading group
25.11.2021	Optimizing Data Usage via Differentiable Reward Meta Back-Translation	Reading group
02.12.2021	Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora Learning a Multi-Domain Curriculum for Neural Machine Translation	Reading group
09.12.2021	Distributioanlly Robust Multilingual Neural Machine Translation	Reading group
16.12.2021	Generating Sentences by Editing Prototypes	Reading group
13.01.2022	Good-Enough Compositional Data Augmentation	Laura Zeidler
20.01.2022	Learning to Recombine and Resample Data For Compositional Generalization	Johannes Eschbach
27.01.2022	Generating Sentences by Editing Prototypes (2nd half) PD algorithm+BayesianOpt+VI Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation	Reading group
03.02.2022	Counterfactual Data Augmentation for Neural Machine Translation	Pablo Ortega Sanchez
10.02.2022	Rethinking DA for Low-Resource Neural Machine Translation	Raziye Sari
17.02.2022	Multi-Domain Neural Machine Translation with Word-Level Adaptive Layerwise Domain Mixing	Phan Anh Tu