Publications
We are pioneers in the field of Audio AI research. AudEERING’s technology is used in many research projects. We provide information about the results of our research in numerous articles, essays, papers and other publications. Take a look at some of our scientific citations as well.
Masking speech contents by random splicing: Is emotional expression preserved?
We discuss the influence of random splicing on the perception of emotional expression in speech signals. Random splicing is the randomized reconstruction of short audio snippets with the aim to obfuscate the speech contents. A part of the German parliament recordings has been random spliced and both versions – the original and the scrambled ones – manually labeled with respect to the arousal, valence and dominance dimensions. Additionally, we run a state-of-the-art transformer-based pretrained emotional model on the data. We find sufficiently high correlation for the annotations and predictions of emotional dimensions between both sample versions to be confident that machine learners can be trained with random spliced data
Doi 10.1109/ICASSP49357.2023.10097094
Multimodal Recognition of Valence, Arousal and Dominance via Late-Fusion of Text, Audio and Facial Expressions
We present an approach for the prediction of valence, arousal, and dominance of people communicating via text/audio/video streams for a translation from and to sign languages.
The approach consists of the fusion of the output of three CNN-based models dedicated to the analysis of text, audio, and facial expressions. Our experiments show that any combination of two or three modalities increases prediction performance for valence and arousal
Doi 10.14428/esann/2023.ES2023-128
Ethical Awareness in Paralinguistics: A Taxonomy of Applications
November 2022, International Journal of Human-Computer Interaction: Since the end of the last century, the automatic processing of paralinguistics has been investigated widely and put into practice in many applications, on wearables, smartphones, and computers. In this contribution, we address ethical awareness for paralinguistic applications, by establishing taxonomies for data representations, system designs for and a typology of applications, and users/test sets and subject areas.
DOI:10.1080/10447318.2022.2140385
Voice Analysis for Neurological Disorder Recognition–A Systematic Review and Perspective on Emerging Trends
July 2022, Frontiers in Digital Health 4:842301 Quantifying neurological disorders from voice is a rapidly growing field of research and holds promise for unobtrusive and large-scale disorder monitoring. The data recording setup and data analysis pipelines are both crucial aspects to effectively obtain relevant information from participants. Therefore, we performed a systematic review to provide a high-level overview of practices across various neurological disorders and highlight emerging trends.
DOI:10.3389/fdgth.2022.842301, LicenseCC BY
A Comparative Cross Language View On Acted Databases Portraying Basic Emotions Utilising Machine Learning
Proceedings of the Thirteenth Language Resources and Evaluation Conference. Since several decades emotional databases have been recorded by various laboratories. Many of them contain acted portrays of Darwin’s famous “big four” basic emotions. In this paper, we investigate in how far a selection of them are comparable by two approaches: on the one hand modeling similarity as performance in cross database machine learning experiments and on the other by analyzing a manually picked set of four acoustic features that represent different phonetic areas. It is interesting to see in how far specific databases (we added a synthetic one) perform well as a training set for others while some do not. Generally speaking, we found indications for both similarity as well as specificiality across languages.
Anthology ID: 2022.lrec-1.204 June 2022, Pages: 1917–1924,
Nkululeko: A Tool For Rapid Speaker Characteristics Detection
Proceedings of the Thirteenth Language Resources and Evaluation Conference. We present advancements with a software tool called Nkululeko, that lets users perform (semi-) supervised machine learning experiments in the speaker characteristics domain. It is based on audformat, a format for speech database metadata description. Due to an interface based on configurable templates, it supports best practise and very fast setup of experiments without the need to be proficient in the underlying language: Python. The paper explains the handling of Nkululeko and presents two typical experiments: comparing the expert acoustic features with artificial neural net embeddings for emotion classification and speaker age regression.
Anthology ID:2022.lrec-1.205, Pages: 1925–1932,
SyntAct: A Synthesized Database of Basic Emotions
Proceedings of the 1st Workshop on Dataset Creation for Lower-Resourced Languages (DCLRL) Speech emotion recognition is in the focus of research since several decades and has many applications. One problem is sparse data for supervised learning. One way to tackle this problem is the synthesis of data with emotion-simulating speech synthesis approaches.
@LREC2022, pages 1–9, Marseille, 24 June 2022 © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0
Perceived emotions in infant-directed narrative across time and speech acts
Speech Prosody 2022, 23-26 May 2022, Lisbon, Portugal One important function of infant-directed speech (IDS) is to ex-press positive emotions towards the baby. This has been shownbased on prosodic parameters before, but parameters such asf0 and energy encode emotion expression only indirectly. Inthis study, we aim to access emotion expression (arousal andvalence) in IDS directly, through labellers’ perception. Record-ings were made in the first 18 months of the baby: in the ageof 0, 4, 8 and 18 months.
May 2022, DOI:10.21437/SpeechProsody.2022-120, Conference: Speech Prosody 2022
Probing Speech Emotion Recognition Transformers for Linguistic Knowledge
April 2022, LicenseCC BY 4.0 Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance.
Dawn of the transformer era in speech emotion recognition: closing the valence gap
Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT.
March 16 ,2022, CC BY-NC-SA 4.0
The Perception and Analysis of the Likeability and Human Likeness of Synthesized Speech
The Perception and Analysis of the Likeability and Human Likeness of Synthesized Speech, Proc. Interspeech, 2018
Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement
Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement. Proc. Interspeech 2019, 1691-1695.
Spoken Language Identification by Means of Acosutic Mid-level Descriptors
Spoken Language Identification by Means of Acosutic Mid-level Descriptors. Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2020, 125-132.
Vergleich verschiedener Machine-Learning Ansätze zur kontinuierlichen Schätzung von perzeptivem Sprechtempo
Vergleich verschiedener Machine-Learning Ansätze zur kontinuierlichen Schätzung von perzeptivem Sprechtempo, In: Birkholz, P., Stone, S. (Eds.): Elektronische Sprachverarbeitung. Studientexte zur Sprachkommunikation 93, pp 164-169, TUDpress, Dresden
Filled pause detection by prosodic discontinuity features
Filled pause detection by prosodic discontinuity features, In: Birkholz, P., Stone, S. (Eds.): Elektronische Sprachverarbeitung. Studientexte zur Sprachkommunikation 93, pp 272-279, TUDpress, Dresden
Emotion-awareness for intelligent vehicle assistants: a research agenda
“Emotion-awareness for intelligent vehicle assistants: a research agenda,” in Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems, pp. 11-15, ACM