Peer-Reviewed Journal Details
Mandatory Fields
Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth Jones.
2016
October
PRAGUE BULLETIN OF MATHEMATICAL LINGUISTICS
FaDA: Fast Document Aligner using Word Embedding
Published
()
Optional Fields
106
1
169
179
FaDA is a free/open-source tool for aligning multilingual documents. It employs a novel crosslingual information retrieval (CLIR)-based document-alignment algorithm involving the distances between embedded word vectors in combination with the word overlap between the source-language and the target-language documents. In this approach, we initially construct a pseudo-query from a source-language document. We then represent the target-language documents and the pseudo-query as word vectors to find the average similarity measure between them. This word vector-based similarity measure is then combined with the term overlap-based similarity. Our initial experiments show that s standard Statistical Machine Translation (SMT)- based approach is outperformed by our CLIR-based approach in finding the correct alignment pairs. In addition to this, subsequent experiments with the word vector-based method show further improvements in the performance of the system.
Poland
https://www.degruyter.com/downloadpdf/j/pralin.2016.106.issue-1/pralin-2016-0016/pralin-2016-0016.pdf
10.1515/pralin-2016-0016
Grant Details
Science Foundation Ireland (SFI)
13/RC/2106