Romanian Journal of Information Science and Technology (ROMJIST)

An open – access publication

  |  HOME  |   GENERAL INFORMATION  |   ROMJIST ON-LINE  |  KEY INFORMATION FOR AUTHORS  |   COMMITTEES  |  

ROMJIST is a publication of Romanian Academy,
Section for Information Science and Technology

Editor – in – Chief:
Radu-Emil Precup

Honorary Co-Editors-in-Chief:
Horia-Nicolai Teodorescu
Gheorghe Stefan

Secretariate (office):
Adriana Apostol
Adress for correspondence: romjist@nano-link.net (after 1st of January, 2019)

Founding Editor-in-Chief
(until 10th of February, 2021):
Dan Dascalu

Editing of the printed version: Mihaela Marian (Publishing House of the Romanian Academy, Bucharest)

Technical editor
of the on-line version:
Lucian Milea (University POLITEHNICA of Bucharest)

Sponsor:
• National Institute for R & D
in Microtechnologies
(IMT Bucharest), www.imt.ro

ROMJIST Volume 29, No. 1, 2026, pp. 53-64, DOI: 10.59277/ROMJIST.2026.1.05
 

Ruxandra TAPU, Bogdan MOCANU, Ionut-Cosmin CHIVA
Multimodal Visual Speech Recognition for Under-Resource Languages via Cross-Modal Learning and Large Language Models

ABSTRACT: This paper introduces a unified approach to multilingual visual speech recognition (VSR) that combines cross-modal phonetic modeling with large-scale language decoding to enable robust generalization across low-resource and previously unseen languages. The architecture within the approach includes a Cross-Modal Transcriber that encodes synchronized audio-visual speech inputs into a language-agnostic phoneme space via a fine-grained cross-attention mechanism. To bridge perception and language understanding, two decoding pathways are explored: (1) a modular configuration that maps phonetic sequences to text using a pretrained large language model (LLM), and (2) an end-to-end formulation in which fused visual features are projected into the LLM’s embedding space via a lightweight adapter for direct transcription. Experimental evaluations on the mTEDx multilingual corpus show that the architecture surpasses state-of-the-art VSR models, achieving up to a 6% absolute improvement in WER across Latin-derived languages.

KEYWORDS: Cross-modal attention; large language models; multilingual learning; visual speech recognition

Read full text (pdf)






  |  HOME  |   GENERAL INFORMATION  |   ROMJIST ON-LINE  |  KEY INFORMATION FOR AUTHORS  |   COMMITTEES  |