2022

Dawn of the transformer era in speech emotion recognition: closing the valence gap

J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Eyben, B. W. Schuller, F. Burkhardt

Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT.
March 16 ,2022, CC BY-NC-SA 4.0

A scientific publication by audEERING GmbH.
More info on our research page