Continuous Sign Language Recognition using CNN-Transformer with Adaptive Temporal Hierarchical Attention
Abstract
Continuous Sign Language Recognition (CSLR) is a critical communication tool for the hearing-impaired community, relying heavily on changes in facial expression, hand movement, and body posture to convey meaning. Traditional CSLR methods primarily focus on frame-level feature extraction but often overlook dynamic temporal relationships across frames. To address this, we propose a novel hybrid architecture CNN Transformer with Adaptive Temporal Hierarchical Attention (CT-ATHA) which captures both local motion patterns and long-range dependencies for improved temporal modeling. Our architecture consists of a ResNet-34 backbone enhanced with Motor Attention Modules (MAM) to emphasize motion-centric regions such as hands and facial areas. Temporal modeling is achieved through a two-stage process: 3DCNN layers extract short-term spatio-temporal features, followed by Adaptive Temporal Pooling to reduce redundant frames, focusing the model’s attention on the most informative temporal segments. A Transformer encoder with hierarchical attention then combines local frame-level and global sentence-level context through specialized attention heads. Additionally, we introduce learnable temporal gates to detect critical motion phases, retaining high-entropy frames and pruning static frames. Our decoder utilizes a BiLSTM with a CTC head for sequence alignment and classification. The model is trained using a multi-task learning approach, jointly optimizing for recognition accuracy and critical phase detection. Experimental evaluation across multiple benchmark CSLR datasets demonstrates that our CT-ATHA model significantly enhances motion information extraction, achieving a WER of 18.1% on RWTH, 18.8% on RWTH-T, and 23.9% on CSL-Daily, despite challenges like variable signing styles and lack of clear segmentation, offering a robust and efficient framework for continuous sign language recognition.
Full Text:
PDFDOI: https://doi.org/10.31449/inf.v49i22.8403

This work is licensed under a Creative Commons Attribution 3.0 License.