In the machine learning world, if you want to accurately evaluate your model you need to be intimately aware of what’s in your audio data. Without this knowledge you can’t be sure of what your model is really learning. Here’s a simple example of when it’s important to know your data. Lets say you want to train a model to tell the difference between a recording of a violin and a recording of viola. Now lets also say that in all your viola examples the viola starts after one second and in all the violin samples the violin starts after 3 seconds, your model may identify the starting point of the sound as the biggest difference between your violin and viola classes. Obviously using this model in the real world won’t work at all. This is why it is important to know your data.
There are many facets to consider when it comes to understanding what is in our audio data. In this blog post I’d like to discuss one which so far has gone overlooked in the speech and music analytics world.
Content from YouTube and other resources – different codecs and different quality
For speech analytics application content from YouTube, podcasts or any online source can provide real-world training and test samples from real people in real situation with really bad microphones/recording conditions. There is potentially an endless supply of content online, however this content is coded with different codecs and at different quality levels. If we want to harness this supply of data we need to know what the codecs are doing with the original audio signal. We must consider how the audio signal has been degraded and what are the implications for our model. Only then can we correctly interpret the training and validation results from our machine learning algorithms.
Here’s an example scenario a data scientist may encounter. Imagine you are testing your male voice vs female voice classifier on TedTalk videos pulled from YouTube and it fails to correctly classify the speaker. Did it fail because your model hasn’t heard a voice like this one before? Or because there’s too much none speech content in the audio, like clapping? Or because the coded audio data has exposed the fact that your model is not robust to audio coding artifacts (degradation of the original audio signal)? To find the answer you need to know your data.
The effect of codecs on the original audio signal
Lets talk about what codecs do to our original audio signal in the context of machine learning from audio features:
When audio is encoded with a lossy-encoder like MP3, AAC, Opus, something is, well, lost. Parts of the signal are deemed to be unimportant to the human listener and discarded. Here’s an analogy from the real world: Imagine someone is speaking to you through thick glass, their voice will sound dull, not as “bright”, this is because the higher frequencies were reflected by the glass and didn’t make it through to your ears. In spite of the dull sounding voice we can still understand what has been said. The same thing can happen with low data rate encoding used to achieve higher data compression, the upper frequencies are deemed to be unimportant for the speech to be intelligible and so they are simply discarded in order to save data. But something is also added to our coded content. Here’s another analogy from the real world: When we use our phone in a pedestrian filled public place the noise around us is added to our phone call. When we use a lossy encoder like MP3, AAC or Opus, noise is added to our signal, this noise is called quantization noise and comes from reducing the resolution of our audio data. The same effect of quantization noise can be readily seen in compressed images as you reduce the resolution more and more. Thankfully the codecs are smart and they do their best to hide the noise from us humans but it is not hidden from our machine learning algorithms. Preliminary testing shows that the openSMILE feature sets are robust to the distortions of the 3 main codecs used today, namely, MP3, AAC and Opus (at mid to high quality levels). Modern parametric codecs like Extended High Efficiency-AAC, (xHE-AAC) have not yet achieved the ubiquity of the MP3 and AAC codecs but they are on their way. xHE-AAC’s coding efficiency far outperforms existing codecs especially at the lower data rates which might seem attractive for data collection. However this type of codec not only adds quantization noise but also discards large portions of the original signal in the hope of recreating what was discarded from just a few parameters. In short, when the mid and upper spectral content is recreated it is only intended to resemble the original content, it aims to deliver a similar “noise-like” or “harmonic-like” impression for the listener. Another name for this type of codec is “non-waveform preserving”, that is to say that after decoding the waveform and indeed the spectral content will have radically changed. Again, these codecs are smart, the decoded output will sound acceptable to us but our machine learning algorithms will see the difference. The artifacts produced will have a significant impact on the features we extract for machine learning.
3 ways to qualify your classification
When testing your model with coded audio there are 3 ways to improve the quality of your classification:
Please select now…
If you selected option one then… I wish you good luck. Companies in the digital audio codec space have spent years developing these codecs in order to achieve the best quality. That is to say they have already done their best to get as much of the compressed data back as possible or at least to make it sound indistinguishable from the original audio (and this is part of our problem).
If you selected option two then… you are on the right track. Depending on the task, you can build an intuition for the features you should and should not select, that is, if you know your data.
For example we know the pitch of the voice is an important feature for male voice vs female voice classification. So it would be a good idea to make sure your pitch estimator is robust to the codec(s) you are working with. Also we know that at the lower quality rates the codecs cut off part of the bandwidth of the signal, making it sound dull. So a bad feature to select would be any feature related to the energy or shape of the upper frequencies.
Lastly, if you selected 3 then well done, if you selected 2 and 3 then super well done as the combination of the two is likely help your model generalize to unseen data. Data augmentation is a well known practice in machine learning. We can simply treat the coded version of our audio data as an augmentation of the data like we would by adding traffic noise or simulating the echoes of a room on clean speech. For example, augmenting your training data with just a few different Opus quality levels will improve the classification of all Opus test samples. Using the openSMILE baseline feature sets, early tests have shown that adding mid and high quality Opus coded versions to our training set significantly improves the classification accuracy for (unseen) low quality Opus coded versions. This means a model augmented with coded content will generalize to other unseen coded content (from the same codec).
In summary, you can feel at easy when using coded content in your machine learning algorithms if you use a high quality, non-parametric codec and make sure to augment your training data.