Speaking Style: or Should Computers Sound Boring?

,
Felix Burkhardt

Text to speech synthesis has made tremendous progress during the past years, but speaking style is still a
challenge.

My Bose box and its tone of voice

Recently my Bose box spoke to me in a voice that sounded so unlike anything that could be generated by such a little box, it gave me a start. Besides that it seemed offended while saying “smartphone connected”, judging from the tone of its voice.

Emotional speech synthesis is an important part of the puzzle on the long way to human-like artificial human-machine interaction. No one ever speaks without emotion and speaking style at times carries more information than the actual words, imagine your car informing you about the lack of fuel in the same tone as the temperature outside.

Emotional simulation as a feature in speech synthesizers

Despite this fact, emotional simulation is not yet a self evident feature in current speech synthesizers.
One reason for this lies certainly in the complexity of human vocal expression: current state-of-the-art synthesizers still struggle with the challenge to pronounce unknown words in an understandable and natural sounding way, although the latter demand already indicates the importance of affective expression.

Standing on giants

To generate speech one might copy the vocal apparatus (flap wings) or the voice signal (just fly). As it turns out, in most practical applications the latter gives faster results. Analogous to artificial “intelligence”, speech can be generated by a set of fixed expert rules or by extracting samples from a database.

When mankind started to work on talking machines in the 40ies [1], they were more concerned with intelligibility of the speech than with diverse vocal expression, although even the first commercial synthesizers, which were rule based like the DEC talk, made famous by Stephen Hawking, came in a variety of voices but not speaking styles, although research showed promising results [2].

Later in the 90ies, the first wave of commercial synthesis in a market outside the medical domain came up with the invention of the PSOLA algorithm [3] that allowed for the modification of melody and rhythm of voice samples constructed by gluing bits of speech at the steady state of phones, known as diphone synthesis. I wouldn’t know about commercial systems, but Iida et al. from ATR [4] and Marc Schröder at the DFKI pioneered with multiplying the database per different voice qualities to simulate emotional arousal.

The appearance of naturalness

Then, when conversational systems really reached the daily life in the new millennium, a new approach called nonuniform unit-selection (which did just that, from much larger databases than diphone synthesis) made a step backwards with respect to emotional flexibility, but tried to add the appearance of naturalness by the insertion of affective sounds in the data. It got more interesting with statistical synthesis that used source-filter based speech production models for voice-ḿorphing experiments that copy the speaking style characteristics of a source to a target speaker, at the expense of buzziness introduced by the source modeling.

While the non-uniform unit-selection approach faces the problem of a strong domain dependence, the new developments in the field of deep neural nets show a way out to the problem of missing generalization capabilities. For example, via socalled global style tokens, Google’s Tacotron architecture can not only copy a speaking style but even do this in a scalable way, i.e. embedding a style with varying strength [5]. It seems that the Amazon engineers are working on a similar approach [7].

Until now interfaces to the expressiveness of speech synthesizers are proprietorial, though there are recommendations for formalization to express emotional style like for example W3C’s EmotionML [8]. In order to reach the holy grail of speech synthesis: to have a synthesizer capable of modeling speaker characteristics from very small data, achievements in physical modeling techniques like articulatory synthesis are promising [6], as are the new achievements in deep neural net processing.

For the curious about how an emotional computer sounds, I collected some samples on this web page:
http://emosamples.syntheticspeech.de/

Literature

[1] Dudley vocoder, e.g. Klatt’s history of speech synthesis: https://tcscasa.org/klatts-history-of-speech-synthesis/
[2] Janet Cahn: http://alumni.media.mit.edu/~cahn/emot-speech.html
[3] Charpentier and Stella: https://ieeexplore.ieee.org/document/1168657
[4] M. Schröder’s review of emotional speech synthesis: https://www.dfki.de/lt/publication_show.php?id=1130
[5] Audiosamples from Google’s tacotron: https://google.github.io/tacotron/publications/global_style_tokens/
[6] P. Birkholz’s VocalTracktLab: http://www.vocaltractlab.de/
[7] https://blog.aboutamazon.com/devices/alexa-whats-the-latest
[8] https://www.w3.org/TR/emotionml/