Peer-Reviewed Journal Details
Mandatory Fields
Shterionov D.;Superbo R.;Nagle P.;Casanellas L.;O’Dowd T.;Way A.
2018
May
Machine Translation
Human versus automatic quality evaluation of NMT and PBSMT
Published
10 ()
Optional Fields
A/B testing BLEU Evaluation metrics F-measure F-score Human evaluation Neural machine translation NMT PBSMT Phrase-based statistical machine translation Productivity Quality comparison Quality evaluation Ranking SMT TER
1
19
© 2018 Springer Science+Business Media B.V., part of Springer Nature Neural machine translation (NMT) has recently gained substantial popularity not only in academia, but also in industry. For its acceptance in industry it is important to investigate how NMT performs in comparison to the phrase-based statistical MT (PBSMT) model, that until recently was the dominant MT paradigm. In the present work, we compare the quality of the PBSMT and NMT solutions of KantanMT—a commercial platform for custom MT—that are tailored to accommodate large-scale translation production, where there is a limited amount of time to train an end-to-end system (NMT or PBSMT). In order to satisfy the time requirements of our production line, we restrict the NMT training time to 4 days; to train a PBSMT system typically requires no longer than one day with the current training pipeline of KantanMT. To train both NMT and PBSMT engines for each language pair, we strictly use the same parallel corpora and the same pre- and post-processing steps (when applicable). Our results show that, even with time-restricted training of 4 days, NMT quality substantially surpasses that of PBSMT. Furthermore, we challenge the reliability of automatic quality evaluation metrics based on n-gram comparison (in particular F-measure, BLEU and TER) for NMT quality evaluation. We support our hypothesis with both analytical and empirical evidence. We investigate how suitable these metrics are when comparing the two different paradigms.
0922-6567
10.1007/s10590-018-9220-z
Grant Details