MetaCLIR: Meta-Textual Information for Cross-lingual Information Retrieval

This data adds textual meta-infomation data to two existing corpora for cross language information retrieval: BoostCLIR, and the Large Scale CLIR Dataset (wiki-clir).

Terms of Use

This data is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please cite (Kuwa et al., 2020) in addition to BoostCLIR and/or wiki-clir if you use this data extension in your work.

Data

This collection extends two existing corpora for cross language information retrieval by providing additional textual meta-information. For BoostCLIR, the extension adds IPC codes for each US and Japanese patent in the corpus. For wiki-clir, the extension adds category information to each article in the languages English, French, German, and Japanese.

Format

  • The format of the addictional meta-information files (*.categories) is:
    document-id [TAB] category1 [SPACE] category2 [SPACE] ... [SPACE] categoryN
    
  • For information about the format of the other data, please refer to the descriptions of BoostCLIR and wiki-clir.

Download

You can download the MAREC data set, which contains the source documents, from TU Wien, and order NTCIR collections from organizers of the NTCIR PatentMT task, which contain the queries.

You additionally need to download the Large-Scale CLIR Dataset wiki-clir corpus to extract the query and document texts in English, French, German, and Japanese.

Publication

  1. Toshitaka Kuwa, Shigehiko Schamoni and Stefan Riezler
    Embedding Meta-Textual Information for Improved Learning to Rank
    Proceedings of the 28th International Conference on Computational Linguistics (COLING), Barcelona, Spain, 2020
    @article{kuwa2020,
      author = {Kuwa, Toshitaka and Schamoni, Shigehiko and Riezler, Stefan},
      year = {2020},
      title = {Embedding Meta-Textual Information for Improved Learning to Rank},
      journal = {Proceedings of the 28th International Conference on Computational Linguistics},
      journal-abbrev = {COLING},
      city = {Barcelona, Spain},
      url = {http://arxiv.org/abs/2010.16313}
    }