When your toaster speaks to you: how should a machine sound?

February 11, 2019,

Felix Burkhardt

How should a machine sound? We’re getting surrounded by talking machines, but are their voices adequate?

Interaction with robots becomes more and more part of our daily life, irrespective of form and size or if they have a body at
all. Home automation, smart speakers, chat apps, care taking robots: all of these use the most natural communication
interface mankind has invented: speech, but most have standard voices that don’t relate to the body they possess.

Another aspect of robotic vocal expression is the speaking style which would in an ideal case adapt to the communication
situation, but at least should be appropriate to the task the robot is designed to do. If for example the robot is designed as
a toy to interact with children, an anchor-man style of voice is probably not a good choice. I wrote a blog post on emotional
speech synthesizers recently that deals with this topic in more detail.

Should robots sound “natural”?

As pointed out by Roger Moore in [1] it is not self-evident that robot voices should be as human-like as possible. Designing them more artificial might even be a solution to avoid the uncanny valley effect [2] which says that if an artificial being sounds natural like a human, but not quite, it doesn’t help the users to feel comfortable but might seem eerie.
Furthermore, a human-like voice for a robot that does nothing then to bring you some food could lead to the temptation to chat about your marriage problems, which probably will result in frustration.
In an investigation by Wilson et al. [3], robot voices from movies and games were investigated and described according to acoustic parameters. Most of them featured a voluntary high degree of artificiality which “… can be achieved by a small increase in pitch, followed by adding harmonies and introducing some echo”. Of course the fact that people were exposed to a certain characteristics of robot (and alien) voices in the media since the 50ies leads to a certain expectation.
So although to my knowledge no speech synthesizer has ever spoken with a flat pitch contour, as soon as you start to speak in a monotonous way people confuse you with a robot. As a consequence: if you want your robot voice to be more convincing: go for it, even if there are no technical reasons to do so.

Large bodies have low voices, or have they?

If robots speak, the voice is usually not generated by mechanical vocal organs but by digital speech synthesis, so the match between voice and appearance does not come natural. Now there are some basic laws of physics that correlate sound of voice to the body that’s producing it, for example that longer vocal chords swing slower. If the match is bad, users might not even realize that the voice originates from the robot.
Another interesting question: is it nicer to have one global intelligence behind that speaks to you through a multitude of machines or would we prefer to have each machine have it’s own voice and personality? And if so, which personality and how about the gender question? Clifford Nass and Scott Brave wrote an interesting book about this already a decade ago [4].

Should the robot sound happy?

The suitability of an emotional synthesizer of course depends primarily on its application: a synthesizer giving cartoon figures a voice meets different demands than a system to make the voice of a speech disabled person more natural. Fun, for example emotional greetings, prosthesis, chat avatars, gaming, believable characters, adapted dialog design, adapted persona design, target-group specific advertising, believable agents, artificial humans are all use-cases for emotional speech, where the applications further down the list are closely related to the development of artificial intelligence. Because emotions and intelligence are closely intermingled, great care is needed when computer systems appear to react emotional without the intelligence to meet the user’s expectations with respect to dialog abilities.

Literature

[1] Moore, R. K.: Appropriate voices for artefacts: some key insights. In 1st Int. Workshop on Vocal Interactivity in-andbetween Humans, Animals and Robots (VIHAR-2017). 2017.
[2] Mori, M.: The Uncanny Valley. Energy, 7(4), pp. 33–35, 1970.
[3] Wilson, S. and Moore, R. K.: Robot, alien and cartoon voices: implications for speech-enabled systems In 1st Int.
Workshop on Vocal Interactivity in-and-between Hu- mans, Animals and Robots (VIHAR-2017). 2017.
[4] Nass, C. and Brave, S.: Wired for Speech. MIT Press 2005