WikiCaps: A Multilingual Dataset of User-generated Captions
WikiCaps is a large-scale multilingual but non-parallel data set for multimodal machine translation and retrieval. The image-caption data was extracted from Wikimedia Commons and is thus a representative of the collection of largely available non-descriptive image-caption pairs in the web. The current version of the dataset contains 3,816,940 images with 3,825,132 English captions and additional 1,000 image-caption pairs in German, French, and Russian together with their English counterparts.
Terms of Use
The textual part of WikiCaps is licensed under a Creative Commons BY-SA 4.0 Unported License. Creative Commons License.
The visual part of WikiCaps is protected under different licenses by the original authors. We thus included a script for downloading the images directly from Wikimedia Commons.
If you use the corpus in your work, please cite: (Schamoni et al., 2018)
Data
The corpus contains image-caption pairs for the English retrieval part, and image-caption pairs for dev and test, with parallel captions in German, French, and Russian and their English counterparts.
The image-caption data was retrieved from Wikimedia Commons. For space and processing efficiency, images were resized to a minimum of 256 pixels (width or height) preserving the original aspect ratio.
For a more detailed description of the corpus construction process, see the above publication and consult the README in the download archive.
#images | #captions | language(s) | |
---|---|---|---|
retrieval | 3,816,940 | 3,825,132 | English |
dev | 1,000 | 1,000 | German–English |
test | 999 | 999 | German–English |
dev | 999 | 999 | French–English |
test | 1,000 | 1,000 | French–English |
dev | 1,000 | 1,000 | Russian–English |
test | 1,000 | 1,000 | Russian–English |
Format
There are three types of data files:
- Monolingual retrieval data (img_en)
- Bilingual dev and test data (.dev or .test file)
- Images list (.lst file)
The format of the img_en file for retrieval is:
image-filename [TAB] English-caption
The format of a bilingual .dev and .test files is:
image-filename [TAB] Foreign-caption ||| English-caption
The images lists .lst contain an image filename on each line as input for wikimgrab.pl (see download archive):
image-filename
Download
wikicaps_v1.0.tar.gz (v1.0, 02/13/2018, 427MB, md5: 47a3aa5cf64f70aced556f1751faedba)
Publication
- A Dataset and Reranking Method for Multimodal MT of User-Generated Image CaptionsProceedings of the 13th biennial conference of the Association for Machine Translation in the Americas (AMTA), Boston, MA, USA, 2018
@inproceedings{schamoni2018, author = {Schamoni, Shigehiko and Hitschler, Julian and Riezler, Stefan}, title = {A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions}, journal = {Proceedings of the 13th biennial conference of the Association for Machine Translation in the Americas}, journal-abbrev = {AMTA}, year = {2018}, city = {Boston, MA}, country = {USA}, url = {http://www.cl.uni-heidelberg.de/~riezler/publications/papers/AMTA2018.1.pdf} }