Empirical Methods for NLP and Data Science
Module Description
Course | Module Abbreviation | Credit Points |
---|---|---|
BA-2010 | AS-CL | 8 LP |
Master | SS-CL, SS-TAC | 8 LP |
Informatik Seminar | IS | 4 LP |
Lecturer | Stefan Riezler |
Module Type | |
Language | English |
First Session | 22.04.2021 |
Time and Place | Thursday, 11:15-12:45, tba |
Commitment Period | tbd. |
Prerequisite for Participation
Good knowledge of statistical machine learning (e.g., by successful completion of courses "Statistical Methods for Computational Linguistics" and/or "Neural Networks: Architectures and Applications for NLP") and experience in experimental work (e.g., software project or seminar implementation project)
Assessment
Content
Most natural language processing (NLP) or data science tasks can be formalized as machine learning problems where a prediction function needs to be learned and evaluated on data pairs of inputs and gold standard outputs. Usually, the representation of input data and the association of inputs to gold standard outputs is not questioned, assuming an ideal machine learning scenario. In real-world NLP problems, machine learning is preceded by a step of establishing representations of input data and of annotating inputs with gold standard labels, and succeeded by a comparative evaluation of the performance of the machine learning model on a held-out set of annotated data. Correct methodology in these phases is essential for the overall success of empirical NLP, however, it is underrepresented in theory and often neglected in practice. In this seminar we will explicitly discuss questions and methods of empirical science that regard the phases preceding and succeeding machine learning, centered around the problems of validity, reliability, and significance.
The problem of VALIDITY includes the following questions:
Another set of questions regards RELIABILITY of human data annotation and in predictions of machine learning models, including a discussion of the following methods:
A last set of problems concerns SIGNIFICANCE, i.e., the question of how theory-critical hypotheses can be tested and confirmed as accurately as possible. Our discussion will include the following items:
Literature
The seminar will be based on a pre-print of the textbook "Validity, Reliability, and Significance: Model-Based Empirical Methods for NLP" (in progress). Stefan Riezler and Michael Hagmann.
A list of further literature will be given in the first session of the seminar.
Enrollment
Please enroll at the CL enrollment page until April 11, 2021.