NLP2CT - 自然語言處理與中葡機器翻譯實驗室

UM-PCorpus: A Large Portuguese-Chinese Parallel Corpus

The UM-PCorpus has been designed to be a multi-domain and high quality parallel for research and development purpose. In this version, a one million Portuguese-Chinese aligned corpus is provided, extracted from parallel and comparable documents, and it is categorized into five different text domains, covering several topics and text genres, including: Newswire, Legal, Subtitle, Technical and General. For a detailed description of the corpus, you may refer to [1].

You should acknowledge with appropriate citation in any publication or presentation containing research results obtained in whole or in part through the use of the UM-PCorpus. The following reference should be cited: [1].

Download UM-PCorpus

Reference

[1] Lidia S. Chao, Derek F. Wong, Chi Hong Ao, Ana Luísa Leal, "UM-PCorpus: A Large Portuguese-Chinese Parallel Corpus". Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC'18), Miyazaki, Japan, 2018.