Optimizing Data Usage in Neural Sequence-To-Sequence Learning
Module Description
Course | Module Abbreviation | Credit Points |
---|---|---|
BA-2010[100%|75%] | CS-CL | 6 LP |
BA-2010[50%] | BS-CL | 6 LP |
BA-2010[25%] | BS-AC | 4 LP |
BA-2010 | AS-CL | 8 LP |
Master | SS-CL, SS-TAC | 8 LP |
Lecturer | Tsz Kin Lam |
Module Type | |
Language | English |
First Session | 28.10.2021 |
Time and Place | Thursday, 11:15-12:45, |
Commitment Period | tba |
Prerequisite for Participation
Assessment (Updated - see materials in the first week)
Inhalt
Deep learning is the de facto standard for many classification tasks, e.g., natural language processing or image recognition. However, it is also notorious of being data hungry. This data hungry nature, together with the costly annotation process, has stimulated a lot of research on creating synthetic data, aka data augmentation. Another popular method is to create additional data by crawling data on web, aka data crawling. Both approaches allow substantial increases of training data at little cost. However, synthetic or crawled data can be noisy, e.g., due to misalignments between source
and target sentences, or due to a domain mismatch between new data and original training data. This casts doubt on the benefits of such additional data to the final model performance, and is also the place where data selection comes into play. The focus of this seminar is on optimizing data usage of neural sequence-to-sequence learning in text data. Participants will learn about the recent advances of data selection, data augmentation and their connections to multi-domain scenarios. The application focus will be sequence-to-sequence learning, especially machine translation. Topics will include (but not limited to):