Books & Tutorials
Validity, Reliability, and Significance
Monograph on Empirical Methods for NLP and Data Science
Tutorial
Tutorial on Statistical Methods for Reproducible Machine Learning
Software
Open-Source Software Projects hosted by our group:
Contrastive Markings
Code for experiments combining postedits and online markings, from the EAMT 2023 paper, Enhancing Supervised Learning...
Joey NMT
Minimalist NMT for educational purposes.
sparse_szo
Sparse Perturbations for Improved Convergence in Stochastic Zeroth-Order Optimization.
QUETCH
Quality estimation for machine translation.
cclir
A cross-language information retrieval (CLIR) toolbox based on the cdec decoder, code package used in Bag-of-words Fo...
rebol
A toolkit for grounded learning for statistical machine translation, as described in the ACL 2014 paper, Response-Bas...
dtrain
A tuning method implemented for the cdec decoder, see Joint Feature Selection in Distributed Stochastic Learning for ...
otedama
Preordering for Machine Translation.
semparse
A semantic parser that treats the task as a monolingual SMT problem. The underyling SMT framework is the cdec decoder.
Contributions by our Group to other Open-Source Software Projects:
nematus
A toolkit for neural machine translation.
Neural Monkey
An open-source tool for sequence learning in NLP.
Corpora
BoostCLIR
A Japanese-English corpus of patent abstracts for patent prior art search, consisting of 100K queries and relevance j...
DeCOCO
German translations for 1000 image captions from the COCO dataset.
HumanMT
Human ratings and corrections for translations from German to English and vice-versa.
LibriVoxDeEn
A corpus for German-to-English Speech Translation and Speech Recognition.
map2seq
A dataset consisting of 7,672 Natural Language Landmark Navigation Instructions and corresponding route paths in Open...
MetaCLIR
Meta-textual information for BoostCLIR and the Large Scale CLIR Dataset (wiki-clir).
NFCorpus
A Full-Text Learning to Rank Dataset for Medical Information Retrieval, extracted from NutritionFacts.org.
NLmaps
A corpus for question-answering, consisting of 2,380 questions in English and German with corresponding Machine Reada...
PatTR
A parallel patent corpus for statistical machine translation featuring three language pairs, German-English (23M sent...
SepsisExp
SepsisExp is a dataset consisting of timelines of patient health data with sepsis labels assigned by senior physicians.
WikiCaps
A large-scale multilingual data set of image-caption pairs for multimodal machine translation, extracted from Wikimed...
WikiCLIR
A large-scale German-English retrieval data set for Cross-Language Information Retrieval, extracted from Wikipedia.