When we hear a person for the first time, we cannot refrain from making assumptions, building expectations, and yes, already forming a first attitude towards the human or artificial personality transported by this voice. Of course, the basis we build our impressions on are very sparse, and hopefully, we adjust our impressions with more valid data than just these few acoustic (and visual) surface features. Nevertheless, this first impression can have an impact on our own behavior, such as reacting more openly or reserved towards the speaker or how we interpret the other’s statements. We may even avoid further interaction altogether, if we consider this person as dislikable.
This not only holds for human voices in face-to-face conversations, but also in telecommunication, especially due to the increased bandwidths and overall quality and naturalness available. Nowadays also the voices of personal assistants, interactive voice responses for self-service, and smart home devices have reached a level of naturalness that makes it sometimes difficult to recognize for their artificiality. As a consequence, social acceptance in first encounters is becoming more relevant not only in human communication, but also in Human-Computer Interaction. Therefore, it is time to concentrate on its social implications.
In addition to the linguistic meaning alone, the voice and speaking style transport much more information. We may, correctly or incorrectly, infer age, social and regional background, and gender. We may even estimate attractiveness, mood, attitude, and aspects of personality. All these factors may affect whether a listener or interlocutor likes or dislikes a speaker. Especially background is prone to positive or negative stereotypes. And without any personal experience, a certain speaking style, e.g. of a call-center agent, might be easily (mis-)attributed to a specific attitude or stance instead of recognizing it for an individual characteristic.
Well-known mechanisms at work are the preference of people with similar backgrounds and sub-cultures, the overall preference of attractive voices, or the preference of speaking styles that signals interest and friendliness, for example by a slightly raised pitch, vivid pitch-movements, a slightly increased tempo, and an audible smile. A clear voice, with their spectral implications of lesser high-frequency damping and stronger harmonics, is also typically favored; as is clear pronunciation. Such mechanisms are rather similar to visual cues (body movements, clothing, face).
At work, we naturally take all those aspects into account when individually selecting, for example, professional voice talents as speakers. However, automatically estimating the likability of a voice is a
delicate topic. For once, only majority votes can be predicted for specific domains. Consider, for example, a person with a strong regional accent. In contrast to estimating gender or age, likability really depends on the eye of the beholder (and I am currently writing this blog in the heart of Bavaria). Also, the specific situation is defining the grounds to interpret whether a certain clearly-pronounced or emphatic-friendly style is adequate or not. Especially for synthetic voices, not fitting the style to the content and situation can dominate the first impression, and thus cloud a clear and attractive timbre, for instance.
Also, systematic effects found in research, as well as automatic models, generalize from data. Therefore, a clear voice should be preferred over speakers which exhibit, e.g., a coarse voice. However, you may have encountered already people who have a mild issue with their vocal folds. Dependent on the exact manifestation, this can actually result in being perceived as even more interesting, attractive and likable.
Therefore likability should not be considered as a trait of a speaker, like age or gender, but as context specific, and as evaluation of (inferred) speaker states and traits by one individual person. There is, of course, agreement between listeners who rate recorded speech samples. This can be modeled and provides a good first estimate for a given context. But this agreement is not as strong as with basic emotions, for example. Just taking into account acoustics only (i.e., quality of the signal, naturalness of the speech, spectral timbre, or intonation), will certainly limit the accuracy of determining likability of a voice. And only by understanding the specifics of the content and the social situation, virtual persons will be able to become likable by design. This holds especially for social robots and synthetic persons in virtual environments, who also have a visual or even manifested body to fit their voice and speaking style to. So, if a voice fits the “being” very well, it might as well sound slightly distorted or out of the ordinary. Despite these challenges there is also potential in the individual dependency between two person in conversation, as mutual conversion in speech style and pronunciation is regarded as positive for likability. We chase approaches for utilizing this effect for adaptive dialog.
As a consequence, there are many ways to be regarded as likable based on your voice. With that many factors being at work, there is no general advice possible to ensure likability. For example, males do not necessarily have to exhibit a low voice in order to be positively perceived. For designers, this is, of course, a common challenge, and there are design methods and user research to find a solution for a given product or service. For automatic classifiers, instead of trying to cover the complex concept of likability, aiming more at related social signals of speakers, such as friendliness and engagement, might be more fruitful in order to obtain accurate predictions – at least for unknown voices. Especially, as relationships develop, these vocal surface features become more irrelevant, as they are only used for identifying a person – we (think to) know their personality already, and will change whether we like this person not on the way of speaking and voice, but on actions.