MetaCLIR: Meta-Textual Information for Cross-lingual Information Retrieval
This data adds textual meta-infomation data to two existing corpora for cross language information retrieval: BoostCLIR, and the Large Scale CLIR Dataset (wiki-clir).
Terms of Use
This data is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please cite (Kuwa et al., 2020) in addition to BoostCLIR and/or wiki-clir if you use this data extension in your work.
Data
This collection extends two existing corpora for cross language information retrieval by providing additional textual meta-information. For BoostCLIR, the extension adds IPC codes for each US and Japanese patent in the corpus. For wiki-clir, the extension adds category information to each article in the languages English, French, German, and Japanese.
Format
- The format of the addictional meta-information files (*.categories) is:
document-id [TAB] category1 [SPACE] category2 [SPACE] ... [SPACE] categoryN
- For information about the format of the other data, please refer to the descriptions of BoostCLIR and wiki-clir.
Download
- BoostCLIR with IPC-codes: boostclir-ipccodes.tar.gz (270MB, md5: deb9f1cb53f6347dd98548e1b7e512b6)
You can download the MAREC data set, which contains the source documents, from TU Wien, and order NTCIR collections from organizers of the NTCIR PatentMT task, which contain the queries.
- wiki-clir categories: wiki-clir-categories.tar.gz (495MB, md5: a4613b0193412a32a6a426c884028960)
You additionally need to download the Large-Scale CLIR Dataset wiki-clir corpus to extract the query and document texts in English, French, German, and Japanese.
Publication
- Embedding Meta-Textual Information for Improved Learning to RankProceedings of the 28th International Conference on Computational Linguistics (COLING), Barcelona, Spain, 2020
@article{kuwa2020, author = {Kuwa, Toshitaka and Schamoni, Shigehiko and Riezler, Stefan}, year = {2020}, title = {Embedding Meta-Textual Information for Improved Learning to Rank}, journal = {Proceedings of the 28th International Conference on Computational Linguistics}, journal-abbrev = {COLING}, city = {Barcelona, Spain}, url = {http://arxiv.org/abs/2010.16313} }