ROMJIST Volume 29, No. 1, 2026, pp. 53-64, DOI: 10.59277/ROMJIST.2026.1.05
Ruxandra TAPU, Bogdan MOCANU, Ionut-Cosmin CHIVA Multimodal Visual Speech Recognition for Under-Resource Languages via Cross-Modal Learning and Large Language Models
ABSTRACT: This paper introduces a unified approach to multilingual visual speech recognition (VSR) that combines cross-modal phonetic modeling with large-scale language decoding to enable robust generalization across low-resource and previously unseen languages. The architecture within the approach includes a Cross-Modal Transcriber that encodes synchronized audio-visual speech inputs into a language-agnostic phoneme space via a fine-grained cross-attention mechanism. To bridge perception and language understanding, two decoding pathways are explored: (1) a modular configuration that maps phonetic sequences to text using a pretrained large language model (LLM), and (2) an end-to-end formulation in which fused visual features are projected into the LLM’s embedding space via a lightweight adapter for direct transcription. Experimental evaluations on the mTEDx multilingual corpus show that the architecture surpasses state-of-the-art VSR models, achieving up to a 6% absolute improvement in WER across Latin-derived languages.KEYWORDS: Cross-modal attention; large language models; multilingual learning; visual speech recognitionRead full text (pdf)
