
Publications
We are pioneers in the field of Audio AI research. AudEERING’s technology is used in many research projects. We provide information about the results of our research in numerous articles, essays, papers and other publications. Take a look at some of our scientific citations as well.
Mental wellbeing at sea: A prototype to collect speech data in maritime settings
The mental wellbeing of seafarers is particularly at risk due to isolation and demanding work conditions. Speech as a modality has proven to be well-suited for assessing mental health associated with mental wellbeing. In this work, we describe our deployment of a speech data collection platform in the noisy and isolated environment of an oil tanker and highlight the associated challenges and our learnings. We collected speech data consisting of 378 survey sessions from 25 seafarers over nine weeks. Our analysis shows that self-reported mental wellbeing measures were correlated with speech-derived features and we present initial modelling approaches. Furthermore, we demonstrate the effectiveness of audio-quality-based filtering and denoising approaches in this uncontrolled environment. Our findings encourage a more fine-grained monitoring of mental wellbeing in the maritime setting and enable future research to develop targeted interventions to improve seafarers’ mental health.
Are you sure? Analysing uncertainty quantification approaches for real-world speech emotion recognition
Uncertainty Quantification (UQ) is an important building block for the reliable use of neural networks in real-world scenarios, as it can be a useful tool in identifying faulty predictions. Speech emotion recognition (SER) models can suffer from particularly many sources of uncertainty, such as the ambiguity of emotions, Out-of-Distribution (OOD) data or, in general, poor recording conditions. Reliable UQ methods are thus of particular interest as in many SER applications no prediction is better than a faulty prediction. While the effects of label ambiguity on uncertainty are well documented in the literature, we focus our work on an evaluation of UQ methods for SER under common challenges in real-world application, such as corrupted signals, and the absence of speech. We show that simple UQ methods can already give an indication of the uncertainty of a prediction and that training with additional OOD data can greatly improve the identification of such signals.
Using voice analysis as an early indicator of risk for depression in young adults
Increasingly frequent publications in the literature report voice quality differences between depressed patients and controls. Here, we examine the possibility of using voice analysis as an early warning signal for the development of emotion disturbances in young adults. As part of a major interdisciplinary European research project in four countries (ECoWeB), examining the effects of web-based prevention programs to reduce the risk for depression in young adults, we analyzed a large number of acoustic voice characteristics in vocal reports of emotions experienced by the participants on a specific day. We were able to identify a number of significant differences in acoustic cues, particularly with respect to the energy distribution in the voice spectrum, encouraging further research efforts to develop promising non-obtrusive risk indicators in the normal speaking voice. This is particularly important in the case of young adults who are less likely to exhibit standard risk factors for depression such as negative life experiences.
A workflow for HTR-postprocessing, labeling and classifying diachronic and regional variation in pre-modern Slavic texts
We describe ongoing work for developing a workflow for the applied use case of classifying diachronic and regional language variation in Pre-Modern Slavic texts. The data were obtained via handwritten text recognition (HTR) on medieval manuscripts and printings and partly by manual transcription. Our goal is to develop a workflow for such historical language data, covering HTR-postprocessing, annotating and classifying the digitized texts. We test and adapt existing language resources to fit the pipeline with low-barrier tooling, accessible for Humanists with limited experience in research data infrastructures, computational analysis or advanced methods of natural language processing (NLP). The workflow starts by addressing ground truth (GT) data creation for diagnosing and correcting HTR errors via string metrics and data-driven methods. On GT and on HTR data, we subsequently show classification results using transfer learning on sentence-level text snippets. Next, we report on our token-level data labeling efforts. Each step of the workflow is complemented with describing current limitations and our corresponding work in progress.
Wav2Small: Distilling wav2Vec2 to 72k paramters for low-resource speech emotion recognition
Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of annotator opinions. However, Concordance Correlation Coefficient (CCC) arose as an alternative metric for A/D/V where a model’s output is evaluated to match a whole dataset’s CCC rather than L2 distances of individual audios. Recent studies have shown that wav2vec2 / wavLM architectures outputting a float value for each A/D/V dimension achieve today’s State-of-the-art (Sota) CCC
on A/D/V. The Wav2Vec2.0 / WavLM family has a high computational footprint, but training small models using human annotations has been unsuccessful. In this paper we use a large Transformer Sota A/D/V model as Teacher/Annotator to train 5 student models: 4 MobileNets and our proposed Wav2Small, using only the Teacher’s A/D/V outputs instead of human annotations. The Teacher model we propose also sets a new Sota on the MSP Podcast dataset of valence CCC=0.676. We choose MobileNetV4 / MobileNet-V3 as students, as MobileNet has been designed for fast execution times. We also propose Wav2Small - an architecture designed for minimal parameters and RAM consumption. Wav2Small with an .onnx (quantised) of only 120KB is a potential solution for A/D/V on hardware with low resources, having only 72K parameters vs 3.12M parameters for MobileNet-V4-Small.
Testing correctness, fairness, and robustness of speech emotion recognition models
Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper introduces a testing framework to investigate behaviour of speech emotion recognition models, by requiring different metrics to reach a certain threshold in order to pass a test. The test metrics can be grouped in terms of correctness, fairness, and robustness. It also provides a method for automatically specifying test thresholds for fairness tests, based on the datasets used, and recommendations on how to select the remaining test thresholds. We evaluated a xLSTM-based and nine transformer-based acoustic foundation models against a convolutional baseline model, testing their performance on arousal, valence, dominance, and emotional category classification. The test results highlight, that models with high correlation or recall might rely on shortcuts – such as text sentiment –, and differ in terms of fairness.
Check your audio data: Nkululeko for bias detection
We present a new release of the software tool Nkululeko. New additions enable users to automatically perform sanity checks, data cleaning, and bias detection in the data based on machine learning predictions. Two open-source databases from the medical domain are investigated: the Androids depression corpus and the UASpeech dysarthria corpus. Results show that both databases have some bias, but not in a severe manner.
Wearable eeg-based cognitive load classification by personalized and generalized model using brain asymmetry
EEG measures have become prominent with the increasing popularity of non-invasive, portable EEG sensors for neuro-physiological measures to assess cognitive load. In this paper, utilizing a four-channel wearable EEG device, the brain activity data from eleven participants were recorded while watching a relaxation video and performing three cognitive load tasks. The data was pre-processed using outlier rejection based on a movement filter, spectral filtering, common average referencing, and normalization. Four frequency-domain feature sets were extracted from 30-second windows encompassing the power of , , , and frequency bands, the respective ratios, and the asymmetry features of each band. A personalized and generalized model was built for the binary classification between the relaxation and cognitive load tasks and self-reported labels. The asymmetry feature set outperformed the band ratio feature sets with a mean classification accuracy of 81.7% for the personalized model and 78% for the generalized model. A similar result for the models from the self-reported labels necessitates utilizing asymmetry features for cognitive load classification. Extracting high-level features from asymmetry features in the future may surpass the performance. Moreover, the better performance of the personalized model leads to future work to update pre-trained generalized models on personal data.
Towards supporting an early diagnosis of multiple sclerosis using vocal features
Multiple sclerosis (MS) is a neuroinflammatory disease that affects millions of people worldwide. Since dysarthria is prominent in people with MS (pwMS), this paper aims to identify acoustic features that differ between people with MS and healthy controls (HC). Additionally, we develop automatic classification methods to distinguish between pwMS and HC. In this work, we present a new dataset of a German-speaking cohort which contains 39 patients with low disability of relapsing MS and 16 HC. Findings suggest that certain interpretable speech features could be useful in diagnosing MS, and that machine learning methods could potentially support fast and unobtrusive screening in clinical practice. The study emphasises the importance of analysing free speech compared to read speech.
audb – sharing and versioning of audio and annotation data in python
Driven by the need for larger and more diverse datasets to pre-train and fine-tune increasingly complex machine learning models, the number of datasets is rapidly growing. audb is an open-source Python library that supports versioning and documentation of audio datasets. It aims to provide a standardized and simple user-interface to publish, maintain, and access the annotations and audio files of a dataset. To efficiently store the data on a server, audb automatically resolves dependencies between versions of a dataset and only uploads newly added or altered files when a new version is published. The library supports partial loading of a dataset and local caching for fast access. audb is a lightweight library and can be interfaced from any machine learning library. It supports the management of datasets on a single PC, within a university or company, or within a whole research community.
Multistage linguistic conditioning of convolutional layers for speech emotion recognition
The effective fusion of text and audio information for categorical and dimensional speech emotion recognition (SER) remains an open issue, especially given the vast potential of deep neural networks (DNNs) to provide a tighter integration of the two. Methods: In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional SER. We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a DNN, and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Results: Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behavior. Discussion: Overall, our multistage fusion shows better quantitative performance, surpassing alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.
Domain-adapting BERT for attributing manuscript, century and region in pre-modern Slavic texts
Our study presents a stratified dataset compiled from six different Slavic bodies of text, for cross-linguistic and diachronic analyses of Slavic Pre-Modern language variants. We demonstrate unsupervised domain adaptation and supervised finetuning of BERT on these low-resource, historical Slavic variants, for the purposes of provenance attribution in terms of three downstream tasks: manuscript, century and copying region classification. The data compilation aims to capture diachronic as well as regional language variation and change: the texts were written in the course of roughly a millennium, incorporating language variants from the High Middle Ages to the Early Modern Period, and originate from a variety of geographic regions. Mechanisms of language change in relatively small portions of such data have been inspected, analyzed and typologized by Slavists manually; our contribution aims to investigate the extent to which the BERT transformer architecture and pretrained models can benefit this process. Using these datasets for domain adaptation, we could attribute temporal, geographical and manuscript origin on the level of text snippets with high F-scores. We also conducted a qualitative analysis of the models’ misclassifications.
Happy or evil laughter? Analysing a database of natural audio samples
We conducted a data collection on the basis of the Google AudioSet database by selecting a subset of the samples annotated with laughter. The selection criterion was to be present a communicative act with clear connotation of being either positive (laughing with) or negative (being laughed at). On the basis of this annotated data, we performed two experiments: on the one hand, we manually extract and analyze phonetic features. On the other hand, we conduct several machine learning experiments by systematically combining several automatically extracted acoustic feature sets with machine learning algorithms. This shows that the best performing models can achieve and unweighted average recall of .7.
Nkululeko: Machine learning experiments on speaker characteristics without programming
We would like to present Nkululeko, a template based system that lets users perform machine learning experiments in the speaker characteristics domain. It is mainly targeted on users not being familiar with machine learning, or computer programming at all, to being used as a teaching tool or a simple entry level tool to the field of artificial intelligence.
Going retro: Astonishingly simple yet effective rule-based prosody modelling for speech synthesis simulating emotion dimensions
We introduce two rule-based models to modify the prosody of speech synthesis in order to modulate the emotion to be expressed. The prosody modulation is based on speech synthesis markup language (SSML) and can be used with any commercial speech synthesizer. The models as well as the optimization result are evaluated against human emotion annotations. Results indicate that with a very simple method both dimensions arousal (.76 UAR) and valence (.43 UAR) can be simulated.
Masking speech contents by random splicing: Is emotional expression preserved?
We discuss the influence of random splicing on the perception of emotional expression in speech signals. Random splicing is the randomized reconstruction of short audio snippets with the aim to obfuscate the speech contents. A part of the German parliament recordings has been random spliced and both versions – the original and the scrambled ones – manually labeled with respect to the arousal, valence and dominance dimensions. Additionally, we run a state-of-the-art transformer-based pretrained emotional model on the data. We find sufficiently high correlation for the annotations and predictions of emotional dimensions between both sample versions to be confident that machine learners can be trained with random spliced data
Doi 10.1109/ICASSP49357.2023.10097094
Multimodal Recognition of Valence, Arousal and Dominance via Late-Fusion of Text, Audio and Facial Expressions
We present an approach for the prediction of valence, arousal, and dominance of people communicating via text/audio/video streams for a translation from and to sign languages.
The approach consists of the fusion of the output of three CNN-based models dedicated to the analysis of text, audio, and facial expressions. Our experiments show that any combination of two or three modalities increases prediction performance for valence and arousal
Doi 10.14428/esann/2023.ES2023-128
Ethical Awareness in Paralinguistics: A Taxonomy of Applications
November 2022, International Journal of Human-Computer Interaction: Since the end of the last century, the automatic processing of paralinguistics has been investigated widely and put into practice in many applications, on wearables, smartphones, and computers. In this contribution, we address ethical awareness for paralinguistic applications, by establishing taxonomies for data representations, system designs for and a typology of applications, and users/test sets and subject areas.
DOI:10.1080/10447318.2022.2140385
Voice Analysis for Neurological Disorder Recognition–A Systematic Review and Perspective on Emerging Trends
July 2022, Frontiers in Digital Health 4:842301 Quantifying neurological disorders from voice is a rapidly growing field of research and holds promise for unobtrusive and large-scale disorder monitoring. The data recording setup and data analysis pipelines are both crucial aspects to effectively obtain relevant information from participants. Therefore, we performed a systematic review to provide a high-level overview of practices across various neurological disorders and highlight emerging trends.
DOI:10.3389/fdgth.2022.842301, LicenseCC BY
A Comparative Cross Language View On Acted Databases Portraying Basic Emotions Utilising Machine Learning
Proceedings of the Thirteenth Language Resources and Evaluation Conference. Since several decades emotional databases have been recorded by various laboratories. Many of them contain acted portrays of Darwin’s famous “big four” basic emotions. In this paper, we investigate in how far a selection of them are comparable by two approaches: on the one hand modeling similarity as performance in cross database machine learning experiments and on the other by analyzing a manually picked set of four acoustic features that represent different phonetic areas. It is interesting to see in how far specific databases (we added a synthetic one) perform well as a training set for others while some do not. Generally speaking, we found indications for both similarity as well as specificiality across languages.
Anthology ID: 2022.lrec-1.204 June 2022, Pages: 1917–1924,
Nkululeko: A Tool For Rapid Speaker Characteristics Detection
Proceedings of the Thirteenth Language Resources and Evaluation Conference. We present advancements with a software tool called Nkululeko, that lets users perform (semi-) supervised machine learning experiments in the speaker characteristics domain. It is based on audformat, a format for speech database metadata description. Due to an interface based on configurable templates, it supports best practise and very fast setup of experiments without the need to be proficient in the underlying language: Python. The paper explains the handling of Nkululeko and presents two typical experiments: comparing the expert acoustic features with artificial neural net embeddings for emotion classification and speaker age regression.
Anthology ID:2022.lrec-1.205, Pages: 1925–1932,
SyntAct: A Synthesized Database of Basic Emotions
Proceedings of the 1st Workshop on Dataset Creation for Lower-Resourced Languages (DCLRL) Speech emotion recognition is in the focus of research since several decades and has many applications. One problem is sparse data for supervised learning. One way to tackle this problem is the synthesis of data with emotion-simulating speech synthesis approaches.
@LREC2022, pages 1–9, Marseille, 24 June 2022 © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0
Perceived emotions in infant-directed narrative across time and speech acts
Speech Prosody 2022, 23-26 May 2022, Lisbon, Portugal One important function of infant-directed speech (IDS) is to ex-press positive emotions towards the baby. This has been shownbased on prosodic parameters before, but parameters such asf0 and energy encode emotion expression only indirectly. Inthis study, we aim to access emotion expression (arousal andvalence) in IDS directly, through labellers’ perception. Record-ings were made in the first 18 months of the baby: in the ageof 0, 4, 8 and 18 months.
May 2022, DOI:10.21437/SpeechProsody.2022-120, Conference: Speech Prosody 2022
Probing Speech Emotion Recognition Transformers for Linguistic Knowledge
April 2022, LicenseCC BY 4.0 Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance.
Dawn of the transformer era in speech emotion recognition: closing the valence gap
Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT.
March 16 ,2022, CC BY-NC-SA 4.0