Continuous Sign Language Recognition using CNN-Transformer with Adaptive Temporal Hierarchical Attention

Junrui Jiao; Meng Zhai

doi:10.31449/inf.v49i22.8403

Contact Editors Europe, Africa:
Matjaz Gams
N. and S. America:
Karthick Gunasekaran
Asia, Australia:
Vinay Singh
Overview papers:
Maria Ganzha
Wiesław Pawlowski
Aleksander Denisiuk Abstacting / Indexing

Informatica is surveyed by:

ACM Digital Library
Citeseer
COBISS
Compendex
Computer & Information Systems Abstracts
Computer Database
Computer Science Index
dLib.si
DBLP Computer Science Bibliography
Directory of Open Access Journals
Google Scholar
InfoTrac OneFile
Inspec
Linguistic and Language Behaviour Abstracts
Mathematical Reviews, MatSciNet, MatSci on SilverPlatter and Current Mathematical Publications
Scopus Publishing

Informatica is published by:

Support

Informatica is supported by:

ACM Slovenia
Slovenian Society for Pattern Recognition
Slovenian Artificial Intelligence Society
Slovenian Society for Cognitive Science
Slovenian Society of Mathematicians, Physicists and Astronomers
Automatic Control Society of Slovenia
Slovenian Academy of Engineering
International Federation for Information Processing

Journal Help

User

Journal Content Search
Browse

Information

Notifications

About The Authors

Junrui Jiao

Meng Zhai

China

Support & Indexing

Continuous Sign Language Recognition using CNN-Transformer with Adaptive Temporal Hierarchical Attention

Junrui Jiao, Meng Zhai

Abstract

Continuous Sign Language Recognition (CSLR) is a critical communication tool for the hearing-impaired community, relying heavily on changes in facial expression, hand movement, and body posture to convey meaning. Traditional CSLR methods primarily focus on frame-level feature extraction but often overlook dynamic temporal relationships across frames. To address this, we propose a novel hybrid architecture CNN Transformer with Adaptive Temporal Hierarchical Attention (CT-ATHA) which captures both local motion patterns and long-range dependencies for improved temporal modeling. Our architecture consists of a ResNet-34 backbone enhanced with Motor Attention Modules (MAM) to emphasize motion-centric regions such as hands and facial areas. Temporal modeling is achieved through a two-stage process: 3DCNN layers extract short-term spatio-temporal features, followed by Adaptive Temporal Pooling to reduce redundant frames, focusing the model’s attention on the most informative temporal segments. A Transformer encoder with hierarchical attention then combines local frame-level and global sentence-level context through specialized attention heads. Additionally, we introduce learnable temporal gates to detect critical motion phases, retaining high-entropy frames and pruning static frames. Our decoder utilizes a BiLSTM with a CTC head for sequence alignment and classification. The model is trained using a multi-task learning approach, jointly optimizing for recognition accuracy and critical phase detection. Experimental evaluation across multiple benchmark CSLR datasets demonstrates that our CT-ATHA model significantly enhances motion information extraction, achieving a WER of 18.1% on RWTH, 18.8% on RWTH-T, and 23.9% on CSL-Daily, despite challenges like variable signing styles and lack of clear segmentation, offering a robust and efficient framework for continuous sign language recognition.

Full Text:

PDF

DOI: https://doi.org/10.31449/inf.v49i22.8403

This work is licensed under a Creative Commons Attribution 3.0 License.

Informatica is financially supported by the Slovenian research agency from the Call for co-financing of scientific periodical publications.

Webmaster: Mario Konecki

Username
Password
Remember me