Publications

Publications

We are pioneers in the field of Audio AI research. AudEERING’s technology is used in many research projects. We provide information about the results of our research in numerous articles, essays, papers and other publications. Take a look at some of our scientific citations as well.

2023

Masking speech contents by random splicing: Is emotional expression preserved?

Felix Burkhardt, Anna Derington, Matthias Kahlau, Klaus Scherer, Florian Eyben, Björn Schuller

We discuss the influence of random splicing on the perception of emotional expression in speech signals. Random splicing is the randomized reconstruction of short audio snippets with the aim to obfuscate the speech contents. A part of the German parliament recordings has been random spliced and both versions – the original and the scrambled ones – manually labeled with respect to the arousal, valence and dominance dimensions. Additionally, we run a state-of-the-art transformer-based pretrained emotional model on the data. We find sufficiently high correlation for the annotations and predictions of emotional dimensions between both sample versions to be confident that machine learners can be trained with random spliced data 

Doi 10.1109/ICASSP49357.2023.10097094 

2023

Multimodal Recognition of Valence, Arousal and Dominance via Late-Fusion of Text, Audio and Facial Expressions

Annette Rios, Uwe Reichel, Chirag Bhuvaneshwara, Panagiotis Filntisis, Petros Maragos, Felix Burkhardt, Florian Eyben, Björn Schuller,Fabrizio Nunnari and Sarah Ebling

We present an approach for the prediction of valence, arousal, and dominance of people communicating via text/audio/video streams for a translation from and to sign languages. 

The approach consists of the fusion of the output of three CNN-based models dedicated to the analysis of text, audio, and facial expressions. Our experiments show that any combination of two or three modalities increases prediction performance for valence and arousal 

Doi 10.14428/esann/2023.ES2023-128

2022

Ethical Awareness in Paralinguistics: A Taxonomy of Applications

A. Batliner, M. Neumann, F. Burkhardt, A. Baird, S. Meyer, T. Vu, B. Schuller,

November 2022, International Journal of Human-Computer Interaction: Since the end of the last century, the automatic processing of paralinguistics has been investigated widely and put into practice in many applications, on wearables, smartphones, and computers. In this contribution, we address ethical awareness for paralinguistic applications, by establishing taxonomies for data representations, system designs for and a typology of applications, and users/test sets and subject areas.
DOI:10.1080/10447318.2022.2140385

2022

Voice Analysis for Neurological Disorder Recognition–A Systematic Review and Perspective on Emerging Trends

P. Hecker, N. Steckhan, F. Eyben, B. W. Schuller, B. Arnrich

July 2022, Frontiers in Digital Health 4:842301 Quantifying neurological disorders from voice is a rapidly growing field of research and holds promise for unobtrusive and large-scale disorder monitoring. The data recording setup and data analysis pipelines are both crucial aspects to effectively obtain relevant information from participants. Therefore, we performed a systematic review to provide a high-level overview of practices across various neurological disorders and highlight emerging trends.
DOI:10.3389/fdgth.2022.842301, LicenseCC BY

2022

A Comparative Cross Language View On Acted Databases Portraying Basic Emotions Utilising Machine Learning

F. Burkhardt, A. Hacker, U. Reichel, H. Wierstorf, F. Eyben, B.W. Schuller

Proceedings of the Thirteenth Language Resources and Evaluation Conference. Since several decades emotional databases have been recorded by various laboratories. Many of them contain acted portrays of Darwin’s famous “big four” basic emotions. In this paper, we investigate in how far a selection of them are comparable by two approaches: on the one hand modeling similarity as performance in cross database machine learning experiments and on the other by analyzing a manually picked set of four acoustic features that represent different phonetic areas. It is interesting to see in how far specific databases (we added a synthetic one) perform well as a training set for others while some do not. Generally speaking, we found indications for both similarity as well as specificiality across languages.
Anthology ID: 2022.lrec-1.204 June 2022, Pages: 1917–1924,

2022

Nkululeko: A Tool For Rapid Speaker Characteristics Detection

F. Burkhardt, J. Wagner, H. Wierstorf, F. Eyben, B. Schuller

Proceedings of the Thirteenth Language Resources and Evaluation Conference. We present advancements with a software tool called Nkululeko, that lets users perform (semi-) supervised machine learning experiments in the speaker characteristics domain. It is based on audformat, a format for speech database metadata description. Due to an interface based on configurable templates, it supports best practise and very fast setup of experiments without the need to be proficient in the underlying language: Python. The paper explains the handling of Nkululeko and presents two typical experiments: comparing the expert acoustic features with artificial neural net embeddings for emotion classification and speaker age regression.
Anthology ID:2022.lrec-1.205, Pages: 1925–1932, 

2022

SyntAct: A Synthesized Database of Basic Emotions

F. Burkhardt, F. Eyben, B.W. Schuller,

Proceedings of the 1st Workshop on Dataset Creation for Lower-Resourced Languages (DCLRL) Speech emotion recognition is in the focus of research since several decades and has many applications. One problem is sparse data for supervised learning. One way to tackle this problem is the synthesis of data with emotion-simulating speech synthesis approaches.
@LREC2022, pages 1–9, Marseille, 24 June 2022 © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0

2022

Perceived emotions in infant-directed narrative across time and speech acts

K. Mády, B. Gyuris, HM Gärtner, A. Kohári, A. Szalontai, U. Reichel,

Speech Prosody 2022, 23-26 May 2022, Lisbon, Portugal One important function of infant-directed speech (IDS) is to ex-press positive emotions towards the baby. This has been shownbased on prosodic parameters before, but parameters such asf0 and energy encode emotion expression only indirectly. Inthis study, we aim to access emotion expression (arousal andvalence) in IDS directly, through labellers’ perception. Record-ings were made in the first 18 months of the baby: in the ageof 0, 4, 8 and 18 months.
May 2022, DOI:10.21437/SpeechProsody.2022-120, Conference: Speech Prosody 2022

2022

Probing Speech Emotion Recognition Transformers for Linguistic Knowledge

A. Triantafyllopoulos, J. Wagner, H. Wierstorf, M. Schmitt, U. Reichel, F. Eyben, F. Burkhardt, B. W. Schuller

April 2022, LicenseCC BY 4.0 Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance.

2022

Dawn of the transformer era in speech emotion recognition: closing the valence gap

J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Eyben, B. W. Schuller, F. Burkhardt

Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT.
March 16 ,2022, CC BY-NC-SA 4.0

2021

Age Classification: Comparison of Human vs Machine Performance in Prompted and Spontaneous Speech

F. Burkhardt, Markus Brückl and Björn Schuller

Age Classification: Comparison of Human vs Machine Performance in Prompted and Spontaneous Speech, Proc. ESSV, 2021, PDF

2020

Acoustic Correlates of Likable Speakers in the NSC Database

Benjamin Weiss, Jürgen Trouvain and F. Burkhardt

Acoustic Correlates of Likable Speakers in the NSC Database, in book: Voice Attractiveness, Studies on Sexy, Likable, and Charismatic Speakers, DOI: 10.1007/978-981-15-6627-1_13, 2020

2019

How should Pepper Sound – Preliminary Investigations on Robot Vocalizations

F. Burkhardt, Milenko Saponja, Julian Sessner and Benjamin Weiss

How should Pepper Sound - Preliminary Investigations on Robot Vocalizations, Proc. of the ESSV 2019, 2019, PDF

2018

Speech Synthesizing Simultaneous Emotion-Related States

F. Burkhardt and Benjamin Weiss

Speech Synthesizing Simultaneous Emotion-Related States, Proc. of the Specom 2018, 2018, PDF

2018

The Perception and Analysis of the Likeability and Human Likeness of Synthesized Speech

Alice Baird, Emilia Parada-Cabaleiro, Simone Hantke, Felix Burkhardt, Nicholas Cummins and Björn Schuller

The Perception and Analysis of the Likeability and Human Likeness of Synthesized Speech, Proc. Interspeech, 2018

2019

Robust Speech Emotion Recognition Under Different Encoding Conditions

+ Oates, C., Triantafyllopoulos, A., Steiner, I., & Schuller, B. W.

Robust Speech Emotion Recognition Under Different Encoding Conditions. Proc. Interspeech 2019, 3935-3939.

2019

Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement

+ Triantafyllopoulos, A., Keren, G., Wagner, J., Steiner, I., & Schuller, B. W.

Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement. Proc. Interspeech 2019, 1691-1695.

2020

Towards Speech Robustness for Acoustic Scene Classification

+ Liu, S., Triantafyllopoulos, A., Ren, Z., & Schuller, B. W.

Towards Speech Robustness for Acoustic Scene Classification. Proc. Interspeech 2020, 3087-3091.

2020

Spoken Language Identification by Means of Acosutic Mid-level Descriptors

+ Reichel, U. D., Triantafyllopoulos, A., Oates, C., Huber, S., & Schuller, B.

Spoken Language Identification by Means of Acosutic Mid-level Descriptors. Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2020, 125-132.

2019

Vergleich verschiedener Machine-Learning Ansätze zur kontinuierlichen Schätzung von perzeptivem Sprechtempo

Weiss, B., Michael, T., Reichel, U., Pauly, O.

Vergleich verschiedener Machine-Learning Ansätze zur kontinuierlichen Schätzung von perzeptivem Sprechtempo, In: Birkholz, P., Stone, S. (Eds.): Elektronische Sprachverarbeitung. Studientexte zur Sprachkommunikation 93, pp 164-169, TUDpress, Dresden

2019

Filled pause detection by prosodic discontinuity features

Reichel, U.D., Weiss, B., Michael, T.

Filled pause detection by prosodic discontinuity features, In: Birkholz, P., Stone, S. (Eds.): Elektronische Sprachverarbeitung. Studientexte zur Sprachkommunikation 93, pp 272-279, TUDpress, Dresden

2018

audEERING’s approach to the One-Minute-Gradual Emotion Challenge

A. Triantafyllopoulos, H. Sagha, F. Eyben, B. Schuller

“audEERING’s approach to the One-Minute-Gradual Emotion Challenge,” arXiv preprint arXiv:1805.01222

2017

Detecting Vocal Irony

J. Deng, B. Schuller, “Detecting Vocal Irony

in Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, Vol. 10713, p. 11, Springer

Emotion-awareness for intelligent vehicle assistants: a research agenda

H. J. Vögel, C. Süß, T. Hubregtsen, V. Ghaderi, R. Chadowitz, E. André, … & B. Huet

“Emotion-awareness for intelligent vehicle assistants: a research agenda,” in Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems, pp. 11-15, ACM

2018

Robust Laughter Detection for Wearable Wellbeing Sensing

G. Hagerer, N. Cummins, F. Eyben, B. Schuller

“Robust Laughter Detection for Wearable Wellbeing Sensing,” in Proceedings of the 2018 International Conference on Digital Health, pp. 156-157, ACM