Conference Publication Details
Mandatory Fields
Haithem Afli and Andy Way
LT4DH - Workshop on Language Technology Resources and Tools for Digital Humanities, Workshop - in conjunction with COLING 2016
Integrating Optical Character Recognition and Machine Translation of Historical Documents.
2016
December
Published
0
()
Optional Fields
109
116
Osaka, Japan
11-DEC-16
17-DEC-16
Machine Translation (MT) plays a critical role in expanding capacity in the translation industry. However, many valuable documents, including digital documents, are encoded in non-accessible formats for machine processing (e.g., Historical or Legal documents). Such documents must be passed through a process of Optical Character Recognition (OCR) to render the text suitable for MT. No matter how good the OCR is, this process introduces recognition errors, which often renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding a new OCR error correction module to enhance the overall quality of translation. Experimentation shows that our new system correction based on the combination of Language Modeling and Translation methods outperforms the baseline system by nearly 30% relative improvement.
http://www.aclweb.org/anthology/W16-4015
Grant Details
Science Foundation Ireland (SFI)
13/RC/2106