Gulchin Abdullayeva, Nigar Alishzade.
A study on recurrent, attention-based, and hybrid neural architectures for sign language recognition
This study presents a systematic evaluation of three neural architectures for isolated sign language recognition: ConvLSTM, Vanilla Transformer, and a novel Hybrid RCNN+Transformer model. Through rigorous experimentation on the Azerbaijani Sign Language Dataset (AzSLD) and Word-Level American Sign Language (WLASL) dataset, we demonstrate that while the Vanilla Transformer achieves superior recognition accuracy (76.8% Top-1 on AzSLD, 88.3% on WLASL), the hybrid architecture attains competitive performance (74.2% on AzSLD, 86.9% on WLASL) with 38% fewer parameters. The ConvLSTM maintains computational efficiency advantages, requiring only 65% of the hybrid model's inference time. Our tripartite analysis reveals a performance spectrum where architectural selection depends on application-specific requirements for accuracy, computational resources, and temporal modeling precision.
Keywords: Sign Language Recognition, Recurrent Neural Networks, Convolutional Neural Networks, Transformer Neural Networks, Visual Transformers
DOI: https://doi.org/10.54381/icp.2025.1.08
A study on recurrent, attention-based, and hybrid neural architectures for sign language recognition
This study presents a systematic evaluation of three neural architectures for isolated sign language recognition: ConvLSTM, Vanilla Transformer, and a novel Hybrid RCNN+Transformer model. Through rigorous experimentation on the Azerbaijani Sign Language Dataset (AzSLD) and Word-Level American Sign Language (WLASL) dataset, we demonstrate that while the Vanilla Transformer achieves superior recognition accuracy (76.8% Top-1 on AzSLD, 88.3% on WLASL), the hybrid architecture attains competitive performance (74.2% on AzSLD, 86.9% on WLASL) with 38% fewer parameters. The ConvLSTM maintains computational efficiency advantages, requiring only 65% of the hybrid model's inference time. Our tripartite analysis reveals a performance spectrum where architectural selection depends on application-specific requirements for accuracy, computational resources, and temporal modeling precision.
Keywords: Sign Language Recognition, Recurrent Neural Networks, Convolutional Neural Networks, Transformer Neural Networks, Visual Transformers
DOI: https://doi.org/10.54381/icp.2025.1.08