Computational Bilingual Lexicography

ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY
Volume 4, Numbers 3-4, 2001, 325 - 351

Dan TUFIS, Ana-Maria BARBU
RACAI -- Romanian Academy Center for Artificial Intelligence
Bucharest, Romania

Abstract.
The paper describes a simple but very effective approach to extraction translation equivalents from parallel corpora. We briefly present the multilingual parallel corpus used in our experiments and then describe the pre-processing steps, a baseline iterative method, and the actual algorithm. The evaluation for the two algorithms is presented in some details in terms of precision, recall and processing time. The baseline algorithm was used to extract 6 bilingual lexicons and it was evaluated on four of them. The second algorithm was evaluated only on the Romanian-English noun lexicon. An analysis of the missed or wrong translation equivalents figured out various factors, both intrinsic, due to the method and extrinsic due to the working data (accuracy of the pre-processing, quality of translation, bitext language relatedness). We conclude by discussing the merits and the drawbacks of our method in comparison with other works and comment on further developments.

Keywords: alignment, bitext, bilingual dictionaries, evaluation, hapax-legomena, lemmatization, parallel corpora, tagging.