Statistical machine translation (SMT) suffers
from various problems which are exacerbated
where training data is in short
supply. In this paper we address the data
sparsity problem in the Farsi (Persian) language
and introduce a new parallel corpus,
TEP++. Compared to previous results
the new dataset is more efficient for
Farsi SMT engines and yields better output.
In our experiments using TEP++ as
bilingual training data and BLEU as a metric,
we achieved improvements of +11.17
(60%) and +7.76 (63.92%) in the Farsi–
English and English–Farsi directions, respectively.
Furthermore we describe an
engine (SF2FF) to translate between formal
and informal Farsi which in terms of
syntax and terminology can be seen as
different languages. The SF2FF engine
also works as an intelligent normalizer for
Farsi texts. To demonstrate its use, SF2FF
was used to clean the IWSLT–2013 dataset
to produce normalized data, which gave
improvements in translation quality over
FBK’s Farsi engine when used as training
data