BVPEC: A Cross-modal BERT-ViT Framework for Performance Emotion Recognition from Multimodal Acting Data

Yizhu Lin

doi:10.31449/inf.v49i21.10434

Contact Editors Europe, Africa:
Matjaz Gams
N. and S. America:
Karthick Gunasekaran
Asia, Australia:
Vinay Singh
Overview papers:
Maria Ganzha
Wiesław Pawlowski
Aleksander Denisiuk Abstacting / Indexing

Informatica is surveyed by:

ACM Digital Library
Citeseer
COBISS
Compendex
Computer & Information Systems Abstracts
Computer Database
Computer Science Index
dLib.si
DBLP Computer Science Bibliography
Directory of Open Access Journals
Google Scholar
InfoTrac OneFile
Inspec
Linguistic and Language Behaviour Abstracts
Mathematical Reviews, MatSciNet, MatSci on SilverPlatter and Current Mathematical Publications
Scopus Publishing

Informatica is published by:

Support

Informatica is supported by:

ACM Slovenia
Slovenian Society for Pattern Recognition
Slovenian Artificial Intelligence Society
Slovenian Society for Cognitive Science
Slovenian Society of Mathematicians, Physicists and Astronomers
Automatic Control Society of Slovenia
Slovenian Academy of Engineering
International Federation for Information Processing

Journal Help

User

Journal Content Search
Browse

Information

Notifications

About The Author

Yizhu Lin
School of Fashion, Dalian Polytechnic University
China

Support & Indexing

BVPEC: A Cross-modal BERT-ViT Framework for Performance Emotion Recognition from Multimodal Acting Data

Yizhu Lin

Abstract

Performance emotional computing is a key technology for understanding and evaluating actors' artistic expression, and it is of great value in film and television analysis, drama education, and other fields. Aiming at the problem that traditional single-modal methods make it difficult to fully capture the rich text and visual emotional information in performances, this study innovatively proposes a performance emotional computing framework (BVPEC) based on the BERT-Vision cross-modal pre-training model. First, the framework deeply integrates the text information of script lines with the video information of actors' performances. It uses the BERT model to deal with lines' semantics and emotional tendencies. Secondly, a Vision Transformer (ViT) is used to extract visual features such as facial expressions and body movements of actors, and a cross-modal adaptive fusion mechanism is designed to achieve information complementarity between modes. Finally, experiments on public data sets (such as the LIRIS-ACCEDE emotional video set) and self-built performance clip data sets show that the BVPEC framework is significantly better than the single-modal model and traditional fusion method in emotion recognition accuracy (up to 89.7%), effectively improving the accuracy and robustness of performance emotion understanding, and providing new ideas for intelligent performing arts analysis.

Full Text:

PDF

DOI: https://doi.org/10.31449/inf.v49i21.10434

This work is licensed under a Creative Commons Attribution 3.0 License.

Informatica is financially supported by the Slovenian research agency from the Call for co-financing of scientific periodical publications.

Webmaster: Mario Konecki

Username
Password
Remember me