Powered by Voice AI
Make voices as important in human-machine interaction as they are in your daily interactions.
Enabling machines to behave interactively, naturally, and individually is possible with Voice AI.
This results in personalized settings, deep insights into users’ needs, and empathy as a KPI of digitalization.
With Voice AI, the greatest possible added value is extracted from human-machine interaction.
devAIce® is our Audio & Voice AI solution that can be used in numerous use cases:
- Market Research
- Robotics & IoT
- Automotive IoT
Software or hardware
Audio analysis for any product
devAIce® analyzes emotional expression, acoustic scenes, and detects many other features from audio. Our AI models perform solidly even with limited CPU power. devAIce® is optimized for low resource consumption. Many models in devAIce® can run in real-time on embedded, low-power ARM devices such as Raspberry Pi and other SoCs.
devAIce® the core technology
Several models included
devAIce® comprises a total of 11 modules that can be combined and used depending on the application and context. Take a look at the devAIce® factsheet or scroll through the module overview for more information.
The VAD module detects the presence of voice in an audio stream. The detection is strongly robust to noise and independent of the volume level of the voice. This means that also faint voices can be detected in the presence of louder background noises.
Detecting voice in large amounts of audio material leads to a resource-saving and efficient analysis process. If VAD runs before the voice analysis itself, large amounts of non-voice data can be filtered and excluded from the analysis.
The Emotion module performs emotion recognition on voice. The module is designed to detect emotions in all languages. Currently, the module combines two independent emotion models with different output representations:
- A dimensional arousal-valence-dominance emotion model
- A four-class categorical emotion model: happy, angry, sad, and neutral
In devAIce® we offer two Emotion modules: Emotion & Emotion Large.
If working with limited computational resources or running your application on devices with lower memory, the Emotion module is a better fit for your needs, compared to the Emotion Large module, which has a higher prediction accuracy.
The Multi-Modal Emotion module combines acoustics- and linguistics-based emotion recognition in a single module. It achieves higher accuracy than models that are limited to only one of the modalities (e.g. the model provided by the Emotion module). Acoustic models tend to perform better at estimating the arousal dimension of emotions, while linguistic models excel at predicting valence (positivity/negativity). The Multi-Modal Emotion module fuses information from both modalities to improve the prediction accuracy.
The Acoustic Scene Module distinguishes between 3 classes:
Further subclasses are recognized in each acoustic scene class. This model is currently under development – the specific subclasses will be named in the next update.
The AED module runs acoustic event detection for multiple acoustic event categories on an audio stream.
Currently, speech and music are supported acoustic event categories. The model allows events of different categories to overlap temporally.
The Speaker Attributes module estimates the personal attributes of speakers from voice and speech. These attributes include:
- A perceived gender (sex), divided into the categories:
- female adult
- male adult
- A perceived age, in years
devAIce® provides two gender submodules called Gender and Gender (Small).
The age-gender models are trained on self-reported gender.
Evaluate whether the speaker in a recording is the same as a previously-enrolled reference speaker.
Two modes are supported:
- Enrollment mode: a speaker model for a reference speaker is created or updated based on one or more reference recordings.
- Verification mode: a previously created speaker model is used to estimate how likely the same speaker is present in a given recording.
*Speaker Verification is currently in development and available in Beta version.
The Prosody module computes the following prosodic features:
- F0 (in Hz)
- Speaking rate (in syllables per second)
Running Prosody module in combination with VAD:
- If the VAD module is disabled, the full audio input is analyzed as a single utterance, and one prosody result is generated.
- When VAD is enabled, the audio input is first segmented by the VAD before the Prosody module is run on each detected voice segment. In this case, individual prosody results for each segment will be output.
The openSMILE Features module performs feature extraction on speech. Currently, it includes the following two feature sets based on openSMILE:
This feature set consists of a total of 6373 audio features that are constructed by extracting energy, voicing, and spectral low-level descriptors and computing statistical functionals on them such as percentiles, moments, peaks, and temporal features.
While originally designed for the task of speaker emotion recognition, it has been shown to also work well for a wide range of other audio classification tasks.
The GeMAPS+ feature set is a proprietary extension of the GeMAPS feature set described in Eyben et al.2. The set consists of a total of 276 audio features and has been designed as a minimalistic, general-purpose set for common analysis and classification tasks on voice.
Jabra's Engage AI
Powered by audEERING
Jabra’s Engage AI is the Call Center tool for enhanced conversations between agents and clients. The integration of audEERING’s advanced AI-powered audio analysis technology devAIce® into the call center context provides a new level of agent and customer experience. The tool fits into numerous contexts, always with the goal of improving communication.
Voice analysis in Market Research
“Improve product or communication testing with emotional feedback! Our method analyzes the emotional state of your customers during the evaluation. This gives you a comprehensive insight into the emotional user experience.”
More about Market Research at audEERING ›
Robots with Empathy
The cooperation between Hanson Robotics and audEERING also seeks to further develop Sophia Hanson’s social skills. In the future, Sophia will recognize the emotions of the conversation and be able to respond empathically as a result.
In nursing and other fields affected by the shortage of skilled workers, such robots equipped with social AI can help in the future.
More about Robotics at audEERING ›
The Simulation Crew
“Emotions are an essential part of our interactions,” says Eric Jutten, CEO of Dutch company The Simulation Crew.
To ensure that their VR trainer, Iva, was emotionally capable, they found devAIce® XR as a solution. Powered by the XR plugin, they integrated Voice AI into their product. Read the full story.
The devAIce® SDK is available for all major desktop, mobile and embedded platforms. It also performs well on devices with low computational resources, like wearables and hearables.
- Platforms: Windows, Linux, macOS, Android, iOS
- Processor architectures: x86-64, ARMv8
devAIce® Web API: cloud-powered,
native for the web
devAIce® Web API is the easiest way to integrate audio AI into your web- and cloud-based applications. On-premise deployment options for the highest data security requirements are available. Web API accessible:
- via command-line CLI tool
- by directly sending HTTP requests
devAIce® XR: the Unity & Unreal plugin
devAIce® XR integrates emotions and intelligent audio analysis into virtuality. The plugin is designed to be integrated into your Unity or Unreal project. Don’t miss the moment to include the most important part of interaction: Empathy.