Current trends for voice based applications

April 11, 2019,

Caro Bauer

Ever wondered what it would be like to have your own personal assistant follow you around wherever you go? Everyone that owns a smartphone or personal computer must have come in touch with one of the many voice assistants out there. Siri, Cortana, Google, Alexa, these are the most well-known ones. Their progress in understanding what you say has been staggering in the last few years. They were able to achieve that because of the progress a field called Automatic Speech Recognition (or ASR) has seen in just this decade.

However, while ASR performance has steadily improved, and voice assistants have been getting consistently better in understanding what you say, it was often the case that they could not understand what it is you mean. There’s no shortage of occasions when you needed to repeat something multiple times, or simplify your expression, limit your usage of words, shorten your sentences, and so on. The next barrier to full-blown intelligent conversations with a machine was Natural Language Processing (or NLP), in short the automatic analysis of your sentence.

Well, I’ve got good news! 2018 has been dubbed the year of NLP by many leading AI researchers – and for good reason. It’s the year that saw tremendous progress in using Deep Neural Networks for solving many of the tasks that have challenged the research community for decades. This opens the door to more intelligent voice assistant applications. But what lies beyond that might forever change the way we conceive the world around us.

Think about the way you interact with most of your gadgets today. To get a response out of your PC or smartphones, you must type with your keyboard, click with your mouse, scroll, navigate tabs, swipe, push with your finger and so on. Now take a moment to consider how you interact with other human beings. Chances are, most of the time you simply talk to them.

Spoken language is our chosen type of communication since our early days as a species. In fact, it probably runs far deeper in our evolutionary history than the first humans who emerged out of Africa. Sounds are so deeply ingrained in the way we interact with our world that it’s impossible to consider a future where our communication with machines is not at least partly based on this modality.

And the moment for that has finally come. ASR was the first frontier. Obviously, if a machine is not able to pick up our choice of words then there’s no sense in trying to communicate with it in speech. Now though, our phones are able to transcribe almost everything we say instantaneously, and even under quite diverse noise conditions.

NLP was the second barrier. What use was for our machine to write down everything we say, if almost nothing made any sort of sense to it. Of course we’re a long way from an algorithm that can comprehend human language with all its intricacies and nuances, but this is not what we need. The most important thing is basic usability.

There’s many more milestones for truly conversational AI, but the two most important ones have been broken. A voice controlled operating system will happen – it IS happening. Siri, Alexa, Google, Cortana, they’re all changing the way we are interacting with our gadgets in our daily lives. Soon enough they’ll become the norm. It will be inconceivable to own a machine that’s not able to understand what you’re saying, much like it’s almost inconceivable to own a machine that you cannot control with the touch of your finger. Much like it’s inconceivable to own a machine that is only accessible through a keyboard – and a terminal.