PatTR: Patent Translation Resource
PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims.
Terms of Use
PatTR is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Please cite (Wäschle & Riezler, 2012), if you use the corpus in your work, or use the data citation specified in the HeiDATA entry.
Data
The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstract, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office (EPO) and the World Intellectual Property Organization (WIPO) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract.
Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United States Patent and Trademark Office (USPTO) corpus, following Utiyama and Isahara (2007).
All sections were sentence-aligned using the Gargantua aligner. Preprocessing was done automatically. Sentence boundaries were detected using the Europarl processing tools.
For a detailed description of the corpus construction process, please see the publications.
Metadata
In addition to the bitext we provide patent metadata for each sentence:
- The patent id of the original document, which can be a patent application or a granted patent.
- The patent family id that groups related documents with mostly overlapping content, e.g. patents for the same invention in different legislations.
- Publication date.
- Classification according to the IPC down to subclass level.
Further metadata, e.g. inventor or company, can be found in the original patent indicated by the document id.
For description data, where the bitext has been collected from two separate documents, metadata is given for both original patents.
Download
Parallel data:
- de-en.tar.gz (2.8GB, md5: 55a074640806d29c9dcfcdb9346e6ce7)
- en-fr.tar.gz (2.4GB, md5: 68cb277faf451b0206eb85c559d29c46)
- fr-de.tar.gz (646MB, md5: 120484093f5f930fe8646eb3b3be76e3)
You can download the MAREC data set, which contains the source documents, from TU Wien.
PatTR is also available from HeiDATA.
Training and test sets for several tasks are available:
- Multi-task learning on text genres and IPC section (Wäschle & Riezler, 2012)
eacl12.tar.gz (1.7GB, md5: f7afcd4cb5189cd8bfc33c95af556215) - Multi-task learning on IPC sections (Simianer & Riezler, 2013)
wmt13.tar.gz (204MB, md5: 20a51980a77af40df30e283a3c33b77e) - Online learning for computer-assisted translation (Wäschle et al., 2013)
mtsummit13.tar.gz (596M, md5: c4b22fcec89aa9e13e26aa5f1db767f9)
Splitting the data
For creating custom training and test sets, an easy option is to split the corpus by document publication date. Note, that abstract and claims data contain a small amount (less than 1%) of duplicate and near-duplicate sentences due to multiple instances of the same patent document in the two corpora. To prevent overlap, make sure family ids of test and training set are disjunct. Furthermore, about 7% of the description data are duplicates. This is caused by the patent writing process, where whole paragraphs are copied verbatim from other documents, e.g. when parts of an invention are similar to a previously filed one. These documents do not share a patent id, so they cannot be easily identified. Indicators are mutual citations and documents filed by the same company. We did not remove these duplicates because they are a feature of patent corpora. Since patent titles are very short and general, 15% of title data are natural duplicates.
Statistics
Section | Sentences | en tokens | de tokens | Bitext size |
---|---|---|---|---|
title | 2,101,107 | 16,457,527 | 13,212,645 | 248MB |
abstract | 720,571 | 30,942,571 | 26,803,868 | 383MB |
claims | 8,346,863 | 501,373,533 | 435,117,827 | 6.1GB |
description | 11,829,816 | 498,948,414 | 386,920,744 | 4.9GB |
total | 22,998,357 | 1,047,722,045 | 862,055,084 | 11.5GB |
Section | Sentences | en tokens | fr tokens | Bitext size |
---|---|---|---|---|
title | 2,504,772 | 19,458,540 | 23,605,412 | 307MB |
abstract | 3,697,670 | 130,801,982 | 144,591,792 | 1.73GB |
claims | 6,966,851 | 422,504,392 | 468,029,948 | 5.3GB |
description | 5,594,745 | 200,043,688 | 204,449,266 | 2.5GB |
total | 18,764,038 | 772,808,602 | 840,676,418 | 9.84GB |
Section | Sentences | fr tokens | de tokens | Bitext size |
---|---|---|---|---|
title | 1,953,815 | 18,337,771 | 12,229,339 | 252MB |
abstract | 122,440 | 5,816,764 | 4,594,012 | 74MB |
claims | 3,034,007 | 206,982,238 | 162,760,901 | 2.5GB |
total | 5,110,262 | 231,136,773 | 179,584,252 | 2.83GB |
The numbers for de-en differ slightly from those reported in (Wäschle & Riezler, 2012) due to some additional processing steps that were performed before the release.
Acknowledgments
The work was in part supported by the “Cross-language Learning-to-Rank for Patent Retrieval” project funded by the Deutsche Forschungsgemeinschaft (DFG).
Publications
- Analyzing Parallelism and Domain Similarities in the MAREC Patent CorpusProceedings of the 5th Information Retrieval Facility Conference (IRFC), Vienna, Austria, 2012
@inproceedings{waeschle2012a, author = {W\"{a}schle, Katharina and Riezler, Stefan}, title = {Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus}, journal = {Proceedings of the 5th Information Retrieval Facility Conference}, journal-abbrev = {IRFC}, year = {2012}, city = {Vienna}, country = {Austria}, url = {http://www.cl.uni-heidelberg.de/~riezler/publications/papers/IRF2012.pdf} }
- Structural and Topical Dimensions in Multi-Task Patent TranslationProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Avignon, France, 2012
@inproceedings{waeschle2012b, author = {W\"{a}schle, Katharina and Riezler, Stefan}, title = {Structural and Topical Dimensions in Multi-Task Patent Translation}, journal = {Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics}, journal-abbrev = {EACL}, year = {2012}, city = {Avignon}, country = {France}, url = {http://www.cl.uni-heidelberg.de/~riezler/publications/papers/EACL2012.pdf} }
- Multi-Task Learning for Improved Discriminative Training in SMTProceedings of the Workshop on Statistical Machine Translation (WMT), Sofia, Bulgaria, 2013
@inproceedings{simianer2013b, author = {Simianer, Patrick and Riezler, Stefan}, title = {Multi-Task Learning for Improved Discriminative Training in SMT}, journal = {Proceedings of the Workshop on Statistical Machine Translation}, journal-abbrev = {WMT}, year = {2013}, city = {Sofia}, country = {Bulgaria}, url = {https://www.cl.uni-heidelberg.de/~riezler/publications/papers/WMT2013.pdf} }
- Generative and Discriminative Methods for Online Adaptation in SMTProceedings of MT SUMMIT XIV, Nice, France, 2013
@inproceedings{waeschle2013, author = {W\"{a}schle, Katharina and Simianer, Patrick and Bertoldi, Nicola and Riezler, Stefan and Federico, Marcello}, title = {Generative and Discriminative Methods for Online Adaptation in SMT}, journal = {Proceedings of MT SUMMIT XIV}, year = {2013}, city = {Nice}, country = {France}, url = {https://www.cl.uni-heidelberg.de/~riezler/publications/papers/MTSUMMIT13.pdf} }