Weakly Supervised Learning of Cross-Lingual Systems
Summary: Cross-lingual rankings for information retrieval can be learned directly from data that are weakly supervised by relevance indicators such as citations in patents or hyperlinks in Wikipedia pages, but are not strictly parallel. We intend to turn this idea on its head by applying the techniques that have been successful for learning-to-rank for cross-lingual retrieval to discriminative training of machine translation on massive non-parallel data, and in the process, further improve methods for cross-lingual retrieval. The key ingredients of our proposed techniques will be the combination of learning from weakly supervised data with techniques that best deploy the weak supervision signals by using fine-grained sparse features and attempt at learning from positive and negative examples. We motivate our research by an application to translation and cross-lingual retrieval in the medical domain where massive amounts of quasi-parallel training data are available on the Internet, in research publications, and patent data.