The Status Quo of Emotional Artificial Intelligence

Felix Burkhardt

Artificial Intelligence becoming emotionally intelligent is discussed everywhere in the media and industry at
the moment. But what is really the current status quo in the industry? Is it measurable by voice assistants like Siri and Alexa?

Are we ripe for emotions?

In large parts of the world the border between humans and machines is getting lower to the pervasive Internet, fueled by two trends: firstly ubiquitous computing via smart phone, wearables, glasses and implants and second home and vehicle automation via smart speakers, interlinked
home components and entertainment systems.
With the vast growth in man-machine communication, the most natural form of communication that men is given comes into focus: speech. But speech is much more than just words: speech is expression of the soul! (if there is any); most of what we express is not defined by the words we use but by the way we say them. As a forgotten Greek actor once boasted: “I can make the audience cry just by reciting the alphabet!”

So-called AI bots like Siri, Alexa, Cortana and Google Now

Ignoring this huge well of information is one of the bigger omissions (among real intelligence) that current so-called AI bots like Siri, Alexa, Cortana or Google Now face. Without it, neither urgency, disinterest, disgust nor irony will be detected and acted upon, all being vital for an interaction that would earn the designation “natural”.
The “emotional channel” was long ignored, I remember a salesman from a large speech technology provider answering to my question about emotional speech synthesis: “We concentrate on stability and leave research to academia” just 15 years ago, but this is changing now, of course also being fueled by the wake of the current AI hype.

The field of emotional AI

Emotional artificial intelligence is a comparatively new field, but there is tremendous motion in the area. Supported by a plethora of newly developed open-source components, modules, libraries and languages to extract acoustic features from audio and feed them into machine learning frameworks, every reasonably able programmer can now throw together a first prototype of an
emotion aware dialog machine in about two working days.
Besides many SMEs, all the big companies like Amazon, Microsoft, IBM or Apple already have solutions for emotion recognition from facial mimics analysis in the market and surely have internal developments for recognition from speech. Many smaller companies offer services for sentiment detection from text, bio-signal analysis, and audio analysis.

But does the technology keeps what the marketeers promise?


The applications are manifold: emotion recognition might help with automated market research; a field were already many companies offering their services as they monitor target groups interacting with a new product while measuring the emotional reaction objectively.
Stress or tiredness detection can help to make traffic more secure, interest or boredom detection are obvious candidates for e-learning software, speaker classification can help adapt automated dialogs like humans would. To mention more fields: automated security observation, believable
characters in gaming or teaching software for professional actors like salespeople and politicians come to mind.
A vast field is also given by the health-care and wellbeing domain: monitoring emotional expression might assist me to understand others and myself and aid in therapeutic treatments.
There are even applications already on the market that perhaps are not so obvious, as for example to make people pay per laugh when watching comedies in a cinema.

But there are, as always in life, dangers and drawbacks:

First of all, what should an AI-driven dialog system do with the information about the user’s emotional state?
A system reacting to my emotions seems more intelligent than a dumb toaster that ignores my urgency. But can it stand up to the raised expectations?
I remember, about 12 years ago, when I programmed my first emotional dialogs the weird moment when my very simple if-then-else dialogue seemed intelligent – just because I had added an unpredictability layer due to the erroneous detection of my own emotional state.
Symbolic AI, to model the world by a rule based expert system, is to this day only successful on very limited domains, the same goes for systems that are based on machine learning: the world is just too complex than that it could be modeled by an artificial neural network or a support vector
machine, besides a very small part of it.
Remember: everything that can happen will happen eventually and some events might be rare, but there’s a really large number of them, so the world is chaotic by nature and eludes models!
A promising way to make the best of both worlds are ontology based machine learning techniques.

Another issue to be conscious about are the ethical consequences of emotion detection technology: there are thousands of definitions of emotions, but most include that emotional expression is something that humans are not conscious of, can not control directly and in many case don’t want to have advertised. So we have to be very careful how we use these systems, if we don’t want to go yet another step in direction to a world envisaged by George Orwell.


Emotion aware technology is based on machine learning, which means it is fueled by data from human interactions. There are several trade-offs to be ware of: acted or elicited laboratory data is of good quality but has very limited significance for real world emotions, which are difficult to harvest given privacy issues and are, per definition, full of noise and unexpected events. Existing databases vary strongly with respect to the acoustic conditions which makes it difficult to simply use them all for a big unified model.

There’s a famous quote which illustrates the emotion definition dilemma quite well: “everyone except a psychologist knows what an emotion is” (Kleinginna & Kleinginna 1970) , so usually what we do is ask humans to label the data for some given categorical system manually, a costly procedure for the very large data that is needed to train machine learning systems that generalize for data from “the wild world outside my lab”.

But these are just the first questions a prospective emotion-detecting engineer would encounter – further the intermingling team play of emotion, mood and personality would confuse the matter. How many emotional states are there at any given time? One? Two? More? How do I sound, being an extrovert, having just learned I failed my exam but being freshly in love? Can I learn to detect my emotions from a Haitian dentist if I’m a German carpenter? If there is a difference between the genders, how does this reflect in their affective expressions? Than there is the question how long an emotion endures, how to split up the data, how to model transitions?

On the bright side

Most of these issues are not exclusive to emotion detection but concern machine learning in general and there are many ideas to tackle them by unsupervised or semisupervised learning, innovative architectures inspired from evolutionary models or subcategorizing parameters for better generalization, just to name a few
Being faced with all these challenges, it is best to start small, keep your expectations realistic and stick to a limited domain defined by your application. Learn from the data that is produced by your system, define your emotional models reflecting the requirements given the use case scenario. But wait: which system? It hasn’t been built yet! A way out of the classic hen-egg-problem is the so called Wizard-of-Oz scenario, in which a concealed human mimics the system behavior in order to provoke user input to the system.

Another one is to start to train the system with data gathered from another application that is similar with respect to acoustic conditions and targeted emotional expression. Or start with a rulebased system running for “friendly users” – in any case each application should incorporate a feedback loop in order to get better with use.

How good can we be?

There is a number of scientific benchmarks running in the research world during the last decade that might give an estimate on system accuracy; starting with the 2009 Interspeech Emotion Challenge and continuing with the first AudioVisual Emotion Challenge (AVEC 2011). Since then, seven annual AVEC challenges took place and the Interspeech series revisited emotional speech recognition in 2013. Meanwhile, challenges considering media-material such as clips of films appeared, namely the annual (since 2013) Emotion in the Wild Challenge (EmotiW14) and the new Multimodal Emotion Challenge (MEC 2016 and 2017). While progress is not directly comparable as mostly different databases were used in the challenges, it can be noted that firstly the underlying databases evolved from laboratory to more realistic data harvested “in the wild” and second, new techniques like sophisticated artificial neural nets architectures or data augmentation lead to more stable results, not to mention the increase of computing power from the newly found application of GPUs.
Furthermore, some rules of thumb can be applied: given a classification task, the accuracy obviously depends on the number of classes and can be expected to be around twice the chance level (I speak of “real world” test data, not laboratory data hand-collected by the system designer). Aspects of emotional expression that influence the speech production apparatus directly, like for example the level of arousal, are much easier detected than for example valence, which is easier to be detected from mimic expression. Of course the fusion of results from different modalities helps, some of which may even be directly derived from the acoustic signal like text analysis or estimating the pulse from fluctuations in the voice.
Compared to a group of human labelers, the machine classification results can be expected to be at least as good as a well-performing human, if not super-human. And, last but definitely not least: strong and clear emotional expression will be much better recognized than weak and obscure signals.


So should I use emotional awareness in my system? By all means, yes! There is still a lot to learn and what currently is called AI does not really deserves the attribution “intelligent”, but ignoring the vast richness of emotional expression in human-machine communication does not in any way help.
Be aware of the pitfalls, avoid to raise unrealistic expectations and be sure to make your system transparent to the user. We just started on a hopefully never ending journey and we won’t get anywhere if we don’t make the first step.
There clearly are already many applications in specific domains that benefit greatly from affective analysis.