Mandatory Fields

Authors

Jiang J.;Way A.;Carson-Berndsen J.

Conference Title

EAMT 2010 - 14th Annual Conference of the European Association for Machine Translation

Title of Paper

Lattice score based data cleaning for phrase-based statistical machine translation

Year

2010

Month

December

Status

Published

Peer Reviewed

Times Cited

()

Optional Fields

Search Keyword

Editors

Start Page

End Page

Location

Start Date

End Date

Abstract

Statistical machine translation relies heavily on parallel corpora to train its models for translation tasks. While more and more bilingual corpora are readily available, the quality of the sentence pairs should be taken into consideration. This paper presents a novel lattice score-based data cleaning method to select proper sentence pairs from the ones extracted from a bilingual corpus by the sentence alignment methods. The proposed method is carried out as follows: firstly, an initial phrasebased model is trained on the full sentencealigned corpus; then for each of the sentence pairs in the corpus, word alignments are used to create anchor pairs and sourceside lattices; thirdly, based on the translation model, target-side phrase networks are expanded on the lattices and Viterbi searching is used to find approximated decoding results; finally, BLEU score thresholds are used to filter out the low-score sentence pairs for the data cleaning purpose. Our experiments on the FBIS corpus showed improvements of BLEU score from 23.78 to 24.02 in Chinese-English. © 2010 European Association for Machine Translation.

Funded By

URL

DOI Link

Grant Details

Funding Body

Grant Details