Abstract.
The paper
describes a simple but very effective approach to extraction translation equivalents from
parallel corpora. We briefly present the multilingual parallel corpus used in our
experiments and then describe the pre-processing steps, a baseline iterative method, and
the actual algorithm. The evaluation for the two algorithms is presented in some details
in terms of precision, recall and processing time. The baseline algorithm was used to
extract 6 bilingual lexicons and it was evaluated on four of them. The second algorithm
was evaluated only on the Romanian-English noun lexicon. An analysis of the missed or
wrong translation equivalents figured out various factors, both intrinsic, due to the
method and extrinsic due to the working data (accuracy of the pre-processing, quality of
translation, bitext language relatedness). We conclude by discussing the merits and the
drawbacks of our method in comparison with other works and comment on further
developments.
Keywords: alignment, bitext, bilingual dictionaries, evaluation, hapax-legomena,
lemmatization, parallel corpora, tagging. |