As mentioned in my first post there is more about this topic I would like to share with you. I will cover which pitfalls there currently are in the field of emotionally intelligent AI, which scientific benchmarks there are and ultimately draw conclusions from this.
Emotion aware technology is based on machine learning, which means it is fueled by data from human interactions. There are several trade-offs to be ware of: acted or elicited laboratory data is of good quality but has very limited significance for real world emotions, which are difficult to harvest given privacy issues and are, per definition, full of noise and unexpected events. Existing databases vary strongly with respect to the acoustic conditions which makes it difficult to simply use them all for a big unified model.
There’s a famous quote which illustrates the emotion definition dilemma quite well: “Everyone except a psychologist knows what an emotion is” (Kleinginna & Kleinginna 1970) , so usually what we do is ask humans to label the data for some given categorical system manually, a costly procedure for the very large data that is needed to train machine learning systems that generalize for data from “the wild world outside my lab”.
But these are just the first questions a prospective emotion-detecting engineer would encounter – further the intermingling team play of emotion, mood and personality would confuse the matter. How many emotional states are there at any given time? One? Two? More? How do I sound, being an extrovert, having just learned I failed my exam but being freshly in love? Can I learn to detect my emotions from a Haitian dentist if I’m a German carpenter? If there is a difference between the genders, how does this reflect in their affective expressions? Then there is the question how long an emotion endures, how to split up the data, how to model transitions?
On the bright side: most of these issues are not exclusive to emotion detection but concern machine learning in general and there are many ideas to tackle them by unsupervised or semisupervised learning, innovative architectures inspired from evolutionary models or subcategorizing parameters for better generalization, just to name a few.
Being faced with all these challenges, it is best to start small, keep your expectations realistic and stick to a limited domain defined by your application. Learn from the data that is produced by your system, define your emotional models reflecting the requirements given the use case scenario. But wait: which system? It hasn’t been built, yet! A way out of the classic hen-egg-problem is the so called Wizard-of-Oz scenario, in which a concealed human mimics the system behavior in order to provoke user input to the system.
Another one is to start to train the system with data gathered from another application that is similar with respect to acoustic conditions and targeted emotional expression. Or start with a rule-based system running for “friendly users” – in any case each application should incorporate a feedback loop in order to get better with use.
How good can we be?
There is a number of scientific benchmarks running in the research world during the last decade that might give an estimate on system accuracy; starting with the 2009 Interspeech Emotion Challenge and continuing with the first Audio-Visual Emotion Challenge (AVEC 2011). Since then, seven annual AVEC challenges took place and the Interspeech series revisited emotional speech recognition in 2013. Meanwhile, challenges considering media material such as clips of films appeared, namely the annual (since 2013) Emotion in the Wild Challenge (EmotiW14) and the new Multimodal Emotion Challenge (MEC 2016 and 2017). While progress is not directly comparable as mostly different databases were used in the challenges, it can be noted that firstly the underlying databases evolved from laboratory to more realistic data harvested “in the wild” and second, new techniques like sophisticated artificial neural nets architectures or data augmentation lead to more stable results, not to mention the increase of computing power from the newly found application of GPUs.
Furthermore, some rules of thumb can be applied: given a classification task, the accuracy obviously depends on the number of classes and can be expected to be around twice the chance level (I speak of “real world” test data, not laboratory data hand-collected by the system designer). Aspects of emotional expression that influence the speech production apparatus directly, like for example the level of arousal, are much easier detected than for example valence, which is easier to be detected from mimic expression. Of course the fusion of results from different modalities helps, some of which may even be directly derived from the acoustic signal like text analysis or estimating the pulse from fluctuations in the voice.
Compared to a group of human labelers, the machine classification results can be expected to be at least as good as a well-performing human, if not super-human. And, last but definitely not least: strong and clear emotional expression will be much better recognized than weak and obscure signals.
So should I use emotional awareness in my system? By all means, yes! There is still a lot to learn and what currently is called AI does not really deserves the attribution “intelligent”, but ignoring the vast richness of emotional expression in human-machine communication does not help in any way.
Be aware of the pitfalls, avoid to raise unrealistic expectations and be sure to make your system transparent to the user. We just started on a hopefully never ending journey and we won’t get anywhere if we don’t make the first step. There clearly are already many applications in specific domains that benefit greatly from affective analysis.