Mandatory Fields

Authors

Haque R.;Naskar S.;van Genabith J.;Way A.

Conference Title

PACLIC 23 - Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation

Title of Paper

Experiments on domain adaptation for English-Hindi SMT

Year

2009

Month

December

Status

Published

Peer Reviewed

Times Cited

()

Optional Fields

Search Keyword

Domain adaptation Statistical machine translation

Editors

Start Page

670

End Page

677

Location

Start Date

End Date

Abstract

Statistical Machine Translation (SMT) systems are usually trained on large amounts of bilingual text and monolingual target language text. If a significant amount of out-of-domain data is added to the training data, the quality of translation can drop. On the other hand, training an SMT system on a small amount of training material for given indomain data leads to narrow lexical coverage which again results in a low translation quality. In this paper, (i) we explore domain-adaptation techniques to combine large out-of-domain training data with small-scale in-domain training data for English-Hindi statistical machine translation and (ii) we cluster large out-of-domain training data to extract sentences similar to in-domain sentences and apply adaptation techniques to combine clustered sub-corpora with in-domain training data into a unified framework, achieving a 0.44 absolute corresponding to a 4.03% relative improvement in terms of BLEU over the baseline. © 2009 by Rejwanul Haque, Sudip Kumar Naskar, Josef van Genabith, and Andy Way.

Funded By

URL

DOI Link

Grant Details

Funding Body

Grant Details