Audio Classification with Machine Learning

At EuroPython 2019 in Basel I gave an introduction to the use of modern machine learning for audio classification. It was very well received, both at the event and the video recording afterwards. The full video recording can be found here (YouTube).

Below I provide a transcript of the presentation, for those who prefer to read instead of looking at a video.

Introduction

[chair]
Welcome to the session.
Jon Nordby is a maker and full-stack IoT developer.
He successfully defended his master’s thesis in Data Science two weeks ago and he’s now embarked on an IoT startup called Soundsensing.
Today he’ll talk to us about a topic related to his thesis: Audio classification with machine learning.

[presenter]
Hi thank you.
So, audio classification is not such a popular topic as for instance image classification or a natural language processing.
So I’m happy to see that there’s still people in the room and interested in this topic.

About me

So first a little about me. I’m an Internet of Things specialist.
Have background in electronics from nine years ago.
Worked a lot as a software engineer, because electronics is mostly software these days or a lot of software.
And then I went to do a Masters in Data Science because IoT to me is the combination of
electronics (sensors especially),
software (you need to process the data),
and data itself – transform sensor data into information that is useful.
Now I’m consulting on IoT and machine learning and I’m also CTO of Soundsensing.
We deliver sensor units for noise monitoring.

About the talk

In this talk my goal (we’ll see if we get there), is that you as a machine learning practitioner, without necessarily prior experience in sound processing, can solve a basic Audio Classification problem.
We’ll have an introduction about digital sound very briefly.
And then we’ll go through a basic audio classification pipeline.
And then some tips and tricks for how to kind of go little bit beyond that basic.
And then I’ll give some pointers to more information.
The slides and a lot of my notes on machine hearing, in general little bit broader than audio classification, is on this github:
https://github.com/jonnor/machinehearing/

Applications

There are some very well recognized subfields of audio, speech recognition is one of them.
And for instance there you have as a classification task you have keyword spotting, so: "Hey Siri" or "OK Google".
And in music analysis you also have many tasks, genre classification for instance can be seen as a simple audio classification task.
We’re gonna keep it mostly on the general levels: we’re not gonna use a lot of speech or music specific domain knowledge.
We still have examples across a wide range of things, I mean anything you can do with hearing as a human you we can get close to (on many tasks at least) with machines today.

In ecoacoustics you might wanna analyze bird migrations using sensor data to see their patterns.
Or you might want to detect poachers in protected areas to make sure that no one actually is going around shooting where there should be no gunshots.
It’s used in quality control in manufacturing. Especially because you don’t have to go into the equipment or the product under test, you can listen it to it from the outside.
For instance used for testing your electrical car seats they check that all motors run correctly.
In security it’s used to help monitor large amounts of CCTVs by also analyzing audio.
And in medical for instance you could detect heart murmurs which could be indicative of a heart condition.

Background

Digital sound

First thing to that is important is that sound is almost always, or basically always, a mixture.
Because sound, I mean, it will move around the corner, unlike image for instance.
And you will always have sound coming from multiple placee.
It will also transport in the ground and be reflected by a wall.
And all these things makes it so you always have multiple sound sources:
The source of interest and then always other sound sources (noise).

Audio Aquisition

Physically we have a sound as its variation in air pressure.
We go through a microphone to converted to electrical voltage,
an Analog-to-Digital converted (ADC) and then we have a digital waveform – which is what we will deal with.
As digital audio, it’s quantized in time for instance with the sampling rate and amplitude.
We usually deal with mono primarily with one channel when we do Audio Classification.
There are some methods around stereo but not widely adopted, and also more channels.
We typically use uncompressed formats, it’s just the safest.
Although you in real life situation you might also have compressed data,
which can have artifacts and so on that might influence your model.

Spectrograms


So after we have a waveform we can convert it in a spectrogram.
And this is in practice a very useful representation,
both as a human for understanding (what is this sound) and for the machines.

To use this one example, is a frog croaking like / r / r / r.
Very periodically, with a little gap.
And then it’s hard to see but in top in a higher level there is some cicadas that are going as well.
This allows us to both see the few frequency representation and the patterns across time.
And together often this allows you to separate different sound sources from your mixture.

Environmental Sound Classification

So we’ll go through a practical example just to keep it kind of hands-on. Environmental Sound Classification.
Given a signal of environmental sounds, these are everyday sounds that are around in the environment.
For instance it can be out outdoors: Cars, children and so on.
It’s very widely researched. We have several open data sets that are quite good.
AudioSet is several I think tens of thousands or even hundreds of thousands of samples.
And in 2017 we’ve reached roughly human level performance (only one of these data sets, ESC-50 , has an estimate for what is human level performance) but we seem to have surpassed enough.

Urbansound8k dataset


And one nice data set is the Urbansound8k which has ten classes of 8k samples.
They’re roughly 4 seconds long and 9 hours total and state of the art here is around 80 percent (79 to 82) of accuracy.
Here we have some spectrograms and these are the easy samples.
This data set has many challenging samples where the sound of interest is very far away and hard to detect, and these ones are easy.
So you see the siren goes like very up and down and jackhammers and drilling have very periodic patterns.
So how can we detect this using a machine learning algorithm in order to output these classes?

A simple Audio Classification pipeline


I will go through a basic audio pipeline, and skipping around like 30 to 100 years of history of audio processing, kind of going straight to what is now the typical way of doing things.
And it looks something like this.

Overview

So first in the input we have our audio stream.
It’s important to realize that of course audio has time related to it so it’s more like a video than to an image.
And in practical case scenario might you do real-time classification so this might be a infinite stream that just goes on and on.
So it’s important for that reason and for also the Machine Learning to divide the stream into small or relatively small analysis windows that you will actually process.

However you often have the mismatch between how often you have labels for your data versus how often you actually want a prediction.
This is known as weak labeling.
I won’t go much into it but so in the Urbansound8k it’s for sound 4 seconds sound snippet so that’s what kind of are we’re given and this curated data set.
However it’s usually beneficial to use smaller analysis windows to reduce the dimensionality of the machine learning algorithm.

So the process goes that we will divide the audio into these segments, it will often use overlap.
And then we’ll convert it into a spectrogram and we’ll use a particular type of spectrogram called a Mel-spectrogram which been shown to work well.
And then we’ll pass that frame or features from the spectrogram into our classifier and it will output the classification for that small time window.
And then because we have labels per 4 seconds we’ll need to do an aggregation in order to come up with the final prediction for these four seconds, not just for this one little window.

Yeah we’ll go through these steps now.

Overlap of analysis windows


So first analysis we notice I mention we often use overlap.
Overlap is sometimes specified in two different ways: one is an overlap percentage.
Here we have 50% overlap, so that means that we’re essentially classifying parts of the or we’re classifying every piece of the other stream twice.
So we could have even more overlap people maybe a 90 percent overlap then we’re classifying it 10 times.
And that gives the algorithm multiple viewpoints on this audio stream, and makes it easier to catch the sounds of interest.
Because the model might in training, might have learned to prefer a certain sound to appear in a certain position inside this analysis window.
Overlap is a good way of of working with that.

Mel-spectrograms preprocessing


so a mission we used a specific type of spectrogram.
So the spectrogram is usually processed with those called Mel-scale filters these are inspired by the human hearing.
Our ability to differentiate sounds of different frequencies reduce is reduced as frequencies to get higher.
So low sounds were able to detect small frequency of variations however high pitched sounds we need large frequency variations towards.
And by using this kind of filters on the spectrogram we obtain the representation is some more similar to our ears.
But more importantly it’s a smaller representation for a machine learning model, and also you’ll be able to merge kind of related data in in two consecutive bins for instance.

So when you’ve done that it looks something like this.
On the top is a normal spectrogram and you see kind of a lot of small details.
The bottom one we’ve we started the Mel filters at 1000 Hertz.
This is bird audios which is quite high pitched, a lot of chirps up and down.
And in the third one we’ve with normalized the earth.
We usually use log-scale compression because sound has a very large dynamic range.
So sounds that are faint versus sounds that are very loud for the human ear is a factor of 1000 or a factor of 10,000 in energy difference.
So when you’ve normalized log-scaled applied the spectrum and normalized you look at something like the image below.

So in Python code this picture shows the processing.
I’m not gonna go through all this in detail, we have an excellent library called librosa which is great for just loading and loading the data and doing basic feature pre-processing.
Also some of the deep learning frameworks have their own Mel-spectrogram implementations that you may also.

Normalizing across sample or window

But there is a general thing in streaming.
So when people analyze audio they often apply normalization learned from the mean for instance across their whole samples for four seconds in this case, or from their whole data set.
That can be hard to to apply when you have a continuous audio stream which has for instance changing volume and so on.
So what we usually do is to normalized per frame so the hope is that you have enough information in our roughly one second of the data in order to do a this normalization.
And doing normalization like this has some interesting consequences when there is no input.
Because what happens is if you have no input to your system, probably you’re gonna blow up all the noise.
So your sometimes need to exclude very low energy signals from being classified.
Just like little practical tip.

Convolutional Neural Network

So Convolutional Neural Networks, they’re hot.
Who here has basic familiarity, at least gone through a tutorial or read a blog post about image classification and CNN’s?
Yeah that’s quite a few.

So CNN’s are the best in class for image classifications.
Spectrograms are image-like audio representation, they have some differences, so a question is (or maybe it was): will CNN’s work well on spectrograms?
Because that would be interesting and the answer is… yes.
This been researched quite a lot, and this is great because there is a lot of tools, knowledge, experience and pre-trained models for image classification.
So being able to reuse those in the audio domain, which is not such a big field, is a major major win.
So you’ll see a lot of the research lately is, it can be a little bit boring in audio classification research, because a lot of it is like taking one year ago image classifying tools and applying them, and seeing whether it works.

It is however a little bit surprising that this actually works because the spectrogram has frequency on the y-axis (typically it’s shown that way) and time on the other axis.
So a movement or a scaling in this space doesn’t mean the same as in an image.
You know an image if I have my face inside an image doesn’t matter where my face appears.
If you have a spectrogram and you have certain sound it’s like a chirp up and down, if you move that up in frequency or down, at least if you move it a lot, it’s probably not the same sound anymore.
It might go from a human talking to a bird, the shape might be similar but the position matters.
So it’s a little bit surprising that this works, but it does seem to do really well in practice.

SB-CNN


So this is one model that does well on Urbansound.
And one thing you’ll note compared to a lot of image models is that it’s quite simple, I mean relatively few layers. This is smaller than or the same kinda size as LeNet5.
And there are three convolutional blocks followed by Mike and with max pooling between the two first blocks.
And that’s, you know, the standard kind of CNN architecture.
This one using five by five instead of two or three kernels, doesn’t make much of a difference you could stack another layer and do the same thing.
And we flatten and then we use a fully connected layer.
From 2016 and still is one is like close to state-of-the-art on this dataset, Urbansound8k.

If you are training CNN from scratch on audio data, start with a simple model.

I mean there’s no usually no reason to start with say VGG16 with 16 layers and millions of parameters,
or even MobileNet or something like that.
You can usually go quite far with with this kind of simple architecture – a couple of convolutional layers.

So in case for example this could look something like this.
Where we have our individual kind of blocks. Convolution, Maxpooling, ReLu non-linearity.
Same for the second one and our full classification at the end the full connected layers.

So this is our classifier.
We will pass the spectrogram analysis window through this,
and it will give us a prediction for which class it was among the classes in Urbansound.

Aggregating analysis windows

And then we need to aggregate these individual windows, and there are multiple ways of doing this.
You could do simplest kind of thing to think about is to do majority voting.
So if we have ten windows over four second spectrum we could do the predictions,
and then just say okay the majority (most common) class wins. That works rather well.
But it’s not differentiable, and so it is done as post-processing.
And you’re making very rough predictions on each on a step.

So mean mean pooling or global average pooling across those analysis windows usually does a little bit better.
And it’s nice with deep learning frameworks is that you can also have this as a layer so for instance in Keras you have the TimeDistributed layer.
Which is there’s sadly extremely a few examples of online.
It’s not that hard to use but it took me a while it to to figure out how they do it.

So we apply a base model, which is in this case the input to this function.
We pass it to the TimeDistributed layer and which essentially it will use a single instance of your model, so it’ll share the weights for all these steps for all the analysis windows.
And then it will just run in multiple times when you do the prediction step and then it will global average pooling over these predictions.
So here we’re averaging predictions you can also you can also do more advanced things where you would for instance average your feature representation and then do a more advanced classifier on top of this.
But this was called probabilistic voting quite often in literature, when you do this mean pooling.
That allows us so this will give us a new model which is what will take not single in analysis windows, which will take a set of analysis windows typically corresponding to our 4 seconds with for example 10 windows.

Demo

So if you do this and a couple more tricks from my thesis you can have a system working like this demo.
So this has, in addition to building model and so on which I’ve gone through,
we’re also deploying to a small microcontroller using the vendor provided tools that converts a Keras model.
And so that’s kind of roughly standard things, so I don’t go into it here.

So little demo video (if we have sound).

Here is Children playing.

Thing is basically what we do also here is we threshold the prediction. So if no prediction is good we’ll consider it unknown.
And this is also important in practice because sometimes you have out of class data.

This is Drilling.
Or this actually the the sample I found said jackhammer. Jackhammer is actually another class.
And really they are to my ear hard to distinguish sometimes and the model can also struggle with it.

There is Dog Barking.

And so in this case all the classification happens on the small sensor unit which is what I focused on in my thesis.

Siren
The model actually it didn’t get the first part of the siren. Only the undulating part later.
So actually these samples are not from the Urbansound dataset which I’ve trained out.
So they’re "out of domain" samples, which is generally a much more challenging.

Yes that’s it for demo.
If you want to know more about doing sound classifications on these sensor units you can get my full thesis.
Both the report and the code is there.

Tips & Trick

Some tips and tricks.
So we’ve covered the basic audio processing pipeline, the modern one, and that will give you results and generally quite good results with the modern CNN architecture.
And there are some tips and tricks especially in practice where when you are having a new problem:
You’re not researching an existing data set, your data sets are usually much smaller and it’s quite costly and tedious to annotate all the data
And so here are some tricks for that.

Data Augmentation


First one is data augmentation.
This is well known from from other deep learning applications, especially image processing.
And the augmentation can be done on audio can be done either in the time domain or in the spectrogram domain. And in practice it both seem to work work fine.
So here are some examples of common common augmentations.
The most common and possibly most important is to do time shifting.
So remember that I said that when you classify an analysis window we may one second you know the sounds of interest there.
And what the individual convolution kernel sees what might be very short.
If you have bird chirps they’re like and those are you know maybe 100 milliseconds max or maybe in 10 milliseconds so they they occupy very little space in that image that the classifier sees.
But it’s important that it’s able to classify classify it no matter where inside this analysis window it appears.
So time shifting simply means that you shift your samples in time forward and backward (left and right).
And that gives them you know that I would have seen that okay bird chirps can appear you know many places in time at any place in time and it doesn’t make a difference to the classification.
So this is by far the most important one and you can usually go quite far with just time shifting.

If you do want precise location of your event, so you want to have a classifier that can tell when did chirps appear.
In the 100 millisecond range instead of just that there was birds in this 4-10 seconds audio then you might not want to do time shifting.
Because you might want to have the output based on that the sound always occurs in the middle of the window.
But then you need to your labeling needs to to respect that.

Time stretching.
So many sounds you know if I speak slowly or I speak very fast its it’s the same, you know it’s the same meaning, it’s the same certainly the same class it’s both speech.
So time stretching is also very very efficient to capture such variations.

And also pitch shifting. So if if I’m speaking with low voice or a high-pitched voice you know, it’s still the same kind of information.
And the same carries in in you know for general sounds, at least a little bit, so a little bit of time shift you a pitch shift you can accept, but a lot of big shift might kind of bring you into new class.
For instance the difference between human speech and and bird chirps might be a big pitch shift.
So so you might want to limit how much you pitch shift.
Typical augmentation settings here is like maybe 10 to 20 percent on time shift and pitch.

You can also add noise.
This is also quite efficient especially if you do know that you have variable amount of noise.
Random noise works okay you can also sample.
There’s lot of repositories of basically noise, so you’ll mix in noise with your signal and classify that.

Mix-up is an interesting data augmentation strategy that makes these two samples like by a linear combination of the two and actually adjusts the labels accordingly.
And that has been shown to work really well also in combination with other augmentation techniques on the audio.

Transfer learning

So yes we can basically apply CNN’s to audio, and with the standard kind of image type architecture. This means that we can do transfer learning from image data.
So of course image data has, I mean is, significantly different from from spectrograms.
I mentioned the frequency axis and so on.
However the some of the base operations that are needed you need to detect edges you need to detect diagonals,
you need to detect patterns of edges and diagonals you need to detect kind of a blob of area.
Those are common kind of functionality needed by both.

So if you do want to use a bigger model, definitely try to use the pre-trained model.
For instance most frameworks including Keras have pretrained models on ImageNet.
The thing is that most of these models they apply take RGB color images as data.
And it can work to just like use one of those channels and zero fill the other ones.
But you can also use just copy the data and cross the three.
There’s also some paper showing that you can do multi-scale.
So for instance one has a spectrogram with very fine time resolution, and one has a one with a very coarse time resolution, and they put them in different channels. And this can be beneficial.
But because image data and some that are quite different you usually do need to fine-tune so it’s usually not enough to just apply a pre-trained model and then just tune the classifier at the end.
You do need to do a couple of layers at the end and typically also the first layer at least. Sometimes you fine-tune the whole thing, but it is generally not very beneficial.
So definitely, if you have a smaller data set, and you need that high performance, and you can’t get it with a small model – go with the pretrained model for instance MobileNet or something like that.

Audio Embeddings

Audio embeddings is another is another strategy.
Inspired by text embeddings where you create a for instance 128 dimensional vector from from your text data, you can do the same with sound.
So with "Look Listen Learn" (L3) you can convert one second audio spectrogram into 512 dimensional vector which has been trained on millions of YouTube videos, so it has seen a very large amount of different sounds and that uses a CNN under the hood. And basically gives you that that very compressed vector classification.
I didn’t finish any code sample here but there is a very nice latest work is OpenL3.
Look "Listen Learn More" is the paper, and they have a Python package which makes it super simple.
Just import, it is one function to pre-process and then you can classify audio basically just with a linear classifier from scikit-learn.

So if you don’t have any deep learning experience and you want it you want to try a Audio Classification problem definitely go this route first.
Because this will basically handle the audio part for you and you’ll just you can apply a simple simple classifier after that.

Annotating datasets

You might want to do your own dataset right.
Audacity is a nice editor for for audio and it has a nice support for annotating by adding a label track.
There’s keyboard shortcuts for for all the functions that you need, so it’s quite quick to use.
So here I’m annotating some custom audio where we did event recognition
and the nice thing is that the format that they have is basically a CSV file.
It has no header and so on, but this Pandas one-liner will basically give you a nice data frame with all your annotations from the sound.

Summary

Time to summarize. We went through the basic audio pipeline.

  • We split the audio into fixed length analysis windows.
  • We used log mel-spectrogram as a sound representation because it’s shown to work very well.
  • We then applied a machine learning model, typically a Convolutional Neural Network.
  • And then we aggregated the predictions from each individual window and we merged them together using global mean pooling.

And for models I would recommend if you trying some new data,
try audio embeddings with OpenL3 and a simple model like notice fire or or random forest for instance.
Try convolutional neural network using transfer learning.
It’s quite powerful and there are usually examples that will get you pretty far.
If you do for instance preprocess spectrograms and save them as PNG files basically you can kind of take any image classification Python code that you have already.
or if you’re willing to kind of ignore this merging of different analysis windows and use that.

Data augmentation is very effective. Time-shift, time-stretch, pitch-shift, noise-add or basically recommended to use.
Sadly there is not such nice like or like go-to implementations of these in infants and Keras generators but it’s not that hard to do.

Some more learning resources for you.
The slides and also a lot of my notes in general are on this Github.
If you want a hands-on experience Tensorflow has a pretty nice tutorial called Simple Audio Recognition about recognizing speech commands.
Which could be an interesting application, but it’s taking a general approach, it’s not speech specific, so you can use it for other things.
Also there’s one book recommendation: Computational Analysis of Sound Scenes and Events.
It’s quite thorough when it comes to general audio classification. A very modern book from 2018.

Questions and Answers

Q.

Hey yeah thanks Jon, a very interesting application of machine learning. I have two questions.
So there’s obviously a like a time series component to your data, I’m not so familiar with this audio classification problem, but alright can you tell us a bit about time series methods maybe LSTM and so on how successful they are?

A.

Yes yeah time series is intuitively one would really want to apply that because there is definitely time component.
So Convolutional Recurrent Neural Networks do quite well when you’re looking at longer time scales.
For instance there’s a classification task called Audio Scene Recognition for instance.
I have a 10 or maybe 30-second clip is this from a restaurant or from a city and so on.
And there you see that the recurrent networks that do have a much stronger time modelling they do better.
But for small short tasks who CNN’s do just fine, surprisingly okay.

Q.

And the other small question I had was just to understand your label, the target that it’s learning.
You said that this is all very mixed the sound is a very mixed data set, so are the labels just like one category of sound when you’re learning,
or would it be beneficial to have you know maybe a weighted set of two categories when doing learning?

A.

Yep, so in ordinary classification tasks the typical style, or kind of by definition, is to have a single label on a some sort of window of time.
You can have multi-label datasets of course and in practice that’s a more realistic modeling of the world, because you basically always have multiple sounds.
So I think AudioSet has multi labeling and there’s a new Urbansound data set now that also has multiple labels.
And then you apply like kind of more tagging approaches.
But you’re still using classification as a base with tagging:
you can either use separate classifier per track of sound of interest,
or you can have a joint model which has multi-label classification.
So definitely this is something that you would want to do but it does complicate the problem.

Q.

You mentioned about data argumentation that we can apply Mixup to separate classes and mix them, and then the label of that mix should be weighted also because it kind of concludes with previous question usually like 0.5 and 0.5 for the other and

A.

Yes so Mixup, it was proposed things like 2-3 years ago, there’s a general method.
So you basically take your sound with your target class and you say okay let’s take 80% of that (not 100%), and then take 20% of some other sound which is a non-target class, mix it together and then update the labels accordingly.
So it’s kind of just telling you hey there is this predominant sound, but there’s also this sound in the background.

Q.

Yes you mentioned about the main frequency ranges but usually when you record audio microphones you get up to 20 thousand Hertz,
so they have you any experience or I could comment on when you have added information of the higher frequency ranges does that affect the machine learning algorithm or yeah.

A.

So typically recordings are done at 44 kilohertz or 48 kilohertz for general audio.
Often machine learning is applied at lower frequency, so with 22 kilohertz or something it is just 16, in the rare case is also 8.
So it depends on the sounds of interest if you’re doing birds definitely you want to keep that those high-frequency things.
If you’re doing speech you can do just fine on 8 kilohertz.
Usually another thing is that noise tends to be in the lower areas of the spectrum, there’s more energy in the lower end of the spectrum.
So if you are doing birds you might want to just ignore everything below 1 kilohertz for instance and that definitely simplifies your model especially if you have a small data set.

Q.

Quick question you mentioned the editor that has support for annotating audio could you please repeat the name?

A.

Yes, Audacity.

Q.

And my general question do having tips if for example you don’t have an existing dataset just starting with a bunch of audio that you want to annotate first.
Do you have any advice for strategies like some maybe semi-supervised?

A.

Yeah semi-supervised is very interesting. There’s a lot of papers but I haven’t seen like very good like practical methodology for it.
And I think in general annotating a data set is it like a whole other talk here, but I’m very interested to come to chat about this later.

Q.

Thanks to you and very nice talk.
My question would be do you have to deal with any pre-processing or like white noise filtering you mean to remove white noise,
it’s exactly just you just said like removing or ignoring certain amount?

A.

Yes, you can. I mean, scoping your frequency range definitely, it’s very easy so just do it if you if you know where your things of interests are.
Denoising. You can apply a separate denoising step beforehand and then do machine learning.
If you don’t have a lot of data, that can be very beneficial.
For instance, maybe you can use a standard denoising algorithm trained on like thousands of hours of stuff, or just a general DSP method.
If you have a lot of data then in practice the machine learning algorithm itself learns to suppress the noise.
But it only works if you have a lot of data.

Q.

So thank you for the talk.
Is it possible to train a deep convolutional neural net directly on the time domain data using a 1d convolutions and dilated convolution?

A.

Yes this is possible, and it is very actively researched.
But it’s only within the last like year or two that they’re getting to the same level of performance as spectrogram-based models.
But some models now are showing actually better performance with the end-to-end trained model
So I expect that in a couple of years maybe that will be the kind of go-to for a practical applications.

Q.

Can I do a speech recognition with this? This is only like six classes like and I think you have much more classes in speech?

A.

Yes if you want to do Automatic Speech Recognition, so the complete vocabulary of English for instance, then you can theoretically.
But there are specific models for all automatic speech recognition that will in general do better.
So if you want full speech recognition you should look at speech specific methods and there are many available.

But if you’re doing a simple task, like commands: yes/no up,down, one, two, three, four, five – you can limit your vocabulary to say maybe under a hundred classes or something, then it gets much more realistic to apply a kind of speech-unaware model like this.

Q.

Thanks for an interesting presentation.
I was just wondering from the thesis, it looks like you applied this model to a microprocessor.
Can you tell a little bit about the framework you use where you transfer it from a Python?

A.

Yes, so we use the vendor provided library from ST microelectronics for the STM32 and it’s called X-CUBE-AI.
You’ll find links in the Github.
It’s a proprietary solution, only works on that microcontroller, but it’s very simple: you throw in the Keras model, it will give you a C model out.
And they have code examples about the audio pre-processing (yeah with some bugs, but it does work).
And the firmware code for thesis is also in the Github repository, not just the model, so you can basically download that and and start going.

Host-based simulation for embedded systems

Embedded systems and microcontroller programs can be really hard to understand. Here are some techniques that can help.

An embedded system is a computer system with a dedicated function within a larger mechanical or electrical system …
– Wikipedia

Examples of embedded systems include everything from a fridge thermostat, to the airbag sensors in your car, to consumer electronics like a electronic keyboard for music. Programming such a system can be extra challenging compared to writing software for a general-purpose computer. The complicating factors may include:

  • Limited computing power, memory, storage or bandwidth
  • Need for low and deterministic response times is needed (soft or hard real-time)
  • Running on battery, within a constrained power budget
  • Need for high reliability without user intervention over long periods of time (months,years)
  • A malfunctioning system may cause material damage or harm people
  • Problem may require considerable domain-specific knowledge

But before all that, we typically need to come to terms with a more mundane and practical problem:
that it is usually very hard to understand what is going on with our program!

Why embedded systems are harder to debug

This problem if hard-to-understand software systems is not at all unique to embedded, it happens frequently also when developing for a PC or mobile device. But several aspects typically make the problem more severe.

  • It is often hard to stimulate the system with inputs automatically, as they are typically physical/external in nature
  • Inputs, outputs and internal state of the system may change faster than humans can observe in real-time
  • The environment of the system typically influences results, sometimes in unforseen ways
  • Low-level programming languages and techniques is still the norm
  • The target device does not have a UI or development tools
  • The connection to a system is often (or at best) a slow serial connection

To make working with the system more pleasant and productive, I have found two techniques very useful:

  • Recording sensor data from device, and then analyzing it ‘offline’ on the PC
  • Running the firmware logic on the PC, by abstracting away hardware dependencies

Case: Triggering MIDI with capacitive sensors

A friend of mine is building a electronic music instrument, a Hang-like thing with 9 pads. It uses capacitive touch sensors and sends MIDI notes over USB for triggering a sampler or synthesizer to make sound.

Core components of system. Capacitive sensor(s) connected to Arduino, sending MIDI messages to computer software over USB.

The firmware on the device was an Arduino sketch, using the CapacitiveSensor library to read the capacitance of the pads.
Summing up N consecutive samples gives basic filtering, and a note is triggered if the value exceedes a specified threshold TH.

The setup worked in principle, and in practice for some ways of hitting the drumpad. But it seemed impossible to find a combination of N and TH where the device would trigger correctly in all cases. If N was too high, then fast taps would be dropped/ignored. If N was low, it caused occational double triggering. If TH was too high, it would not trigger on single finger taps. If TH was too low, it would trigger when a palm was just hovering over the pad.

How to make it work reliably?

To solve a problem, you first have to understand it

Several hours had been spent tweaking the values, recompiling the sketch, uploading and hitting the pads a bit, listening if it did the right thing. This gave us some rough intuitions about things that worked and not, but generally our understanding of the situation was quite sparse. Which is understandable; there is only so much one can learn from observing a system in real-time from the outside with the naked eye/ear.

It seemed that how the pad was hit had an influence. Which is plausible, the surface area might influences how effective the capacitive coupling is.

Example poses that should trigger. Hitting with fingertip, finger flat, 3-fingers and side of thumb.

Some cases which should *not* trigger. Hand hovering over pad, touching casing with thumb but not the pad.

What we needed to understant was: What is the input data in different scenarios?

First added logging to the Arduino sketch, sending the time between readings and the current sensor value (for one sensor).

  const long beforeRead = millis();
  const Input input = readInputs();
  const long afterRead = millis();
  Serial.print("(");
  Serial.print(afterRead-beforeRead);
  Serial.print(",");
  Serial.print(input.values[0].capacitance);
  Serial.println(")");

The stream showed that with N=70, the reading the sensors took a long time, over 40ms. This is unacceptable for a musical instrument, so it was clear that this value *had* to go down. Any issues caused would have to be fixed in some other way.

Then a small Python script to read the serial port, and writing the raw data to a file.
This allowed us to record the data from a whole scenario. For instance we recorded things like ‘tapping-3-times-quickly’ or ‘hovering-then-touching’, using the filename as a description for what the scenarios was, and directories to group sets of recordings.

Another Python script was then used for analysing the data, parsing the raw values and using matplotlib to plot it out.

Now we could finally *see* what was going on, over longer periods of time and compare different scenarios against eachother.

This case illustrated the crux of our problem. The red areas indicate where we are hovering *over* the pad with palm, but the sensed capacitance values are higher than when touching with a fingertip.
If the threshold is set too high (orange line) we miss the finger tap, and if it is too low *yellow) we will false trigger on a hovering palm.

Can we do better with an alternative detection algorithm? Maybe a high-pass filter to detect the changes at the edges, it may be possible be possible to identify both cases. Plugging in a high-pass filter in the Python analysis script and playing with the values seemed to support this.

But we cannot run Python for real-time processing. We need to be able to implement the filter in the Arduino firmware.

Exponential Moving Average filters

An Exponential Moving Average (EMA or EMWA) was selected as the basis of the filter. It has many desirable properties for use in a latency-sensitive application on a microcontroller: It only requires storing one number, is computationally simple, and is robust against variation in sampling time (jitter). And unlike a FIR filter, it does not introduces latency (apart from the time-constant of the filter itself). Here is a nice introduction for Arduino usage.

static int
exponentialMovingAverage(const int value, const int previous, const float alpha) {
  return (alpha*value) + ((1-alpha)*previous);
}
...
next.highfilter = exponentialMovingAverage(input.capacitance, previous.highfilter, appConfig.highpass);
next.highpassed = input.capacitance - next.highfilter;
...

What value should highpass have and how do we know if works correctly?

Host-based simulation

A regular Arduino sketch can generally only run on the target microcontroller. This is because the application logic is mixed with the hardware-dependent I/O libraries, in this case CapacitiveSensor and MidiUSB.
But Arduino is just C++. Nothing prevents us from separating out the application logic and making it hardware-independent so it can also execute on our host. The easiest method is to put the code into a .hpp, and then include that in our sketch and any host-only tools we have.

#include "./hangdrum.hpp"

This lets us use all the regular C++ tools and practices for testing and validating code, without needing access to the hardware. Automated unit- and integration-testing, fuzz-testing, mutation testing, dynamic analysis like Valgrind, using a continious integration services like Travis CI. In a project with custom hardware, it lets you develop most parts of the software before the hardware is finalized, potentially saving a lot of time.

I like to express the entire application logic of the firmware as a pure function which takes Input and current State, and returns the new State. This formulation lets us know exactly what may affect the system – no hidden dependencies or state.

State
calculateState(const State &previous, const Input &input, const Config &config) {
  State next = previous;
  next.time = input.time;

  for (unsigned int i=0; i<N_PADS; i++) {
    next.pads[i] = calculateStatePad(previous.pads[i], input.values[i], config.pads[i], config);
  }
  calculateMidiMessages(next, config, next.messages);
  return next;
}

Because all the inputs and outputs of the functions are plain-old-data, we can safely and meaningfully serialize and deserialize them.
To get better visibility into the internals of the system and help our understanding, we also store intermediate values:

PadState
calculateStatePad(const PadState &previous, const PadInput input,
                  const PadConfig &config, const Config &appConfig) {
...
  next.raw = input.capacitance;
  next.highfilter = exponentialMovingAverage(input.capacitance, previous.highfilter, appConfig.highpass);
  next.highpassed = input.capacitance - next.highfilter;
  next.lowpassed = exponentialMovingAverage(next.highpassed, previous.lowpassed, appConfig.lowpass);
  next.value = next.lowpassed;
...
}

If we had used a dataflow system like MicroFlo, we would have gotten this introspectability for free.

Combining the recorded input data logs with this platform-independent application logic, we can now make a simulator for our firmware:

    std::vector<hangdrum::Input> inputEvents = read_events(filename);
    std::vector<hangdrum::State> states;

    hangdrum::State state;
    for (auto &event : inputEvents) {
        state = hangdrum::calculateState(state, event, config);
        states.push_back(state);
    }
    write_flowtrace(trace_filename, trace);

To store the execution this I used a Flowtrace, a JSON-based format for tracing Flow-based-programming/dataflow system.

Because time is just data in our programming model (part of Input or State), we can run through hours of input scenarios in seconds.
I made another plotting tool, this time reading the flowtrace, visualizing all the steps in our signal processing pipeline, and the detected notes.

Avoiding false triggering for hovering hand, by looking at changes instead of absolute values.

By going over a range of different input scenarios and seeing how different values perform, we get a decent confidence that the algorithm works. But does it actually run fast enough on the Arduino?

Profiling on device

The Atmel AVR chip on the Arduino Leonardo is an 8-bit processor without a floating point unit. So I was a bit worried about the exponential averaging filter using several expensive features: 16bit `int`, divisions and a multiplication with a float. Using a Arduino sketch to do some simple profiling showed that my worries were unfounded.

const long beforeCalculation = millis();
State next = state;
for (int i=0; i<100; i++) {
  next = hangdrum::calculateState(state, input, config);
}
state = next;
const long afterCalculation = millis();
Serial.print("calculating: ");
Serial.println(afterCalculation-beforeCalculation);

The 100 iterations of the application logic executed it took 80 ms with both a high-pass and low-pass, or less than 1ms per execution. Since sensor readout is up to 10 ms, it dominates the time spent. So if we want lower latency, optimization efforts should be focused on sensor readout first. Only when sensor readout is down to around 1ms does it make sense to optimize the filtering.

Don’t forget the hardware

Testing the code with highpass-based in practice showed that yes, it did correctly detect tapping while supressing false triggers from a hovering palm over the sensor. Another benefit when using change detection a notes will trigger even if a finger is currently touching, and hitting the pad with another finger. With absolute value thresholding, the second finger tap is not detected.

However, we also found that by moving the sensor to the outside, the data quality increases a lot. In this case, even the simple absolute threshold code was able to correctly discriminate a hovering palm. The higher data quality may also enable other features like velocity or aftertouch.

Putting sensor on outside of body gives better readings, but requires additional manufacturing steps.

 

In conclusion

Sensor-recording together with a separting hardware from application logic,  and host-based simulation form powerful tools that help you better understand an embedded system. By visualizing the input data, the internal state and the outputs of the firmware, you can more easily see and understand the conditions which cause problems. The effects of changing the firmware can be quickly understood, as re-running the simulation suite on a wide range of inputs can be done in seconds. It can be implemented easily in C++ firmware, and any scripting language can be used for the data analysis.

The techniques here form a baseline level of tooling which can be extended in many ways.
If we had recorded a high-framerate video stream together with the input data, it would be easier to see how the input data corresponds to physical actions.

We could annotate the input data to indicate the correct locations of notes (expected output of system), and write an automated test against the output trace to check how well the firmware detects them. By using data-driven testing, we could generate test variations over different inputs and filter configuration. And then use machine learning techniques to help find the best values for the filters.

We could also create an end-to-end test covering the vast majority of the code by inject input sensor data in the on-device firmware over serial, and then verify that it produces the expected MIDI messages.

Live programming IoT systems with MsgFlo+Flowhub

Last weekend at FOSDEM I presented in the Internet of Things (IoT) devroom,
showing how one can use MsgFlo with Flowhub to visually live-program devices that talk MQTT.

If the video does not work, try the alternatives here. See also full presentation notes, incl example code.

Background

Since announcing MsgFlo in 2015, it has mostly been used to build scalable backend systems (“cloud”), using AMQP and RabbitMQ. At The Grid we’ve been processing hundred thousands of jobs each week, so that usecase is pretty well tested by now.

However, MsgFlo was designed from the beginning to support multiple messaging systems (including MQTT), as well as other kinds of distributed systems – like a networks of embedded devices working together (one aspect of “IoT”). And in MsgFlo 0.10 this is starting to work pretty nicely.

Visual system architecture

Typical MQTT devices have the topic names hidden in code. Any documentation is typically kept in sync (or not…) by hand.
MsgFlo lets you represent your devices and services as FBP/dataflow “components”, and a system as a connected graph of component instances. Each device periodically sends a discovery message to the broker. This message describing the role name, as well as what ports exists (including the MQTT topic name). This leads to a system architecture which can be visualized easily:

Imaginary solution to a typically Norwegian problem: Avoiding your waterpipes freezing in the winter.

Rewire live system

In most MQTT devices, output is sent directly to the input of another device, by using the same MQTT topic name. This hardcodes the system functionality, reducing encapsulation and reusability.
MsgFlo each device *should* send output and receive inports on topic namespaced to the device.
Connections between devices are handled on the layer above, by the broker/router binding different topics together. With Flowhub, one can change these connections while the system is running.

Change program values on the fly

Changing a parameter or configuration of an embedded device usually requires changing the code and flashing it. This means recompiling and usually being connected to the device over USB. This makes the iteration cycle pretty slow and tedious.
In MsgFlo, devices can (and should!) expose their parameters on MQTT and declare them as inports.
Then they can be changed in Flowhub, the device instantly reflecting the new values.

Great for exploratory coding; quickly trying out different values to find the right one.
Examples are when tweaking animations or game mechanics, as it is near impossible to know up front what feels right.

Add component as adapters

MsgFlo encourages devices to be fairly stupid, focused on a single generally-useful task like providing sensor data, or a way to cause actions in the real world. This lets us define “applications” without touching the individual devices, and adapt the behavior of the system over time.

Imagine we have a device which periodically sends current temperature, as a floating-point number in Celcius unit. And a display device which can display text (for instance a small OLED). To show current temperature, we could wire them directly:

Our display would show something like “22.3333333”. Not very friendly, how does one know what this number means?

Better add a component to do some formatting.

Adding a Python component

Component formatting incoming temperature number to a friendly text string

And then insert it before the display. This will create a new process, and route the data through it.

Our display would now show “Temperature: 22.3 C”

Over time maybe the system grows further

Added another sensor, output now like “Inside 22.2 C Outside: -5.5 C”.

Getting started with MsgFlo

If you have existing “things” that support MQTT, you can start using MsgFlo by either:
1) Modifying the code to also send the discovery message.
2) Use the msgflo-foreign-participant tool to provide discovery without code changes

If you have new things, using one of the MsgFlo libraries is a quick way to support MQTT and MsgFlo. Right now there are libraries for Python, C++11, Node.js, NoFlo and Arduino.

Gitorious Merge Request Monitor

In Maliit, all changes have to be reviewed by two people in order to be merged to mainline. This helps us catch issues early and keep code quality high. Since the code is hosted on Gitorious, we use their merge requests feature for that purpose. Up until now we have periodically checked the website for changes (potentially going through each and every one of the repositories), and manually mentioned updates in the IRC channel. This is both tedious and inefficient, so I wrote a simple tool to help the issue: Gitorious Merge Request Monitor

It provides an IRC Bot which gives status updates on merge requests in an IRC channel:

16:01 < mrqbot-AfFa1> desertconsulting requested merge of ~desertconsulting/maliit/desertconsultings-maliit-framework with maliit-framework in http://gitorious.org/maliit/maliit-framework/merge_requests/126
16:01 < mrqbot-AfFa1> mikhas updated http://gitorious.org/maliit/maliit-framework/merge_requests/125  State changed from Go ahead and merge to Merged

One can also query the current status from it:

15:02 < jonnor> mrqbot-7ACeB: list
15:02 < mrqbot-7ACeB> maliit-framework/127: - New - Allow QML plugins to add custom import paths for QML files and QML modules
15:02 < mrqbot-7ACeB> maliit-framework/126: - Need info - configurable importPath for qml
15:02 < mrqbot-7ACeB> maliit-plugins/26: - New - Add PluginClose from main view and add key repetition support
15:02 < mrqbot-7ACeB> maliit-plugins/25: - New - Clear active keys and magnifier on keyboard change
15:02 < mrqbot-7ACeB> maliit-plugins/24: - New - Remove QtGui dependency from libmaliit-keyboard
15:02 < mrqbot-7ACeB> maliit-plugins/23: - New - Get rid of Qt keywords
15:02 < mrqbot-7ACeB> maliit-plugins/22: - New - Add phone number and number layout getters.

Status changes are retrieved by periodically checking the Gitorious project activity feed (Atom)*, and the status itself is scraped from the website. There is no other API right now, unfortunately. Implemented in Python with Twisted, feedparser and BeautifulSoup doing all of the heavy lifting.

Get it from PyPi, using easy_install or pip:
pip install gitorious-mrq-monitor
gitorious-mrq-monitor --help # For usage information

For now this solves the immediate need for the development work-flow we have in the Maliit project. Several ideas for extending the tool are mentioned in the TODO. Contributions welcomed!

The Maliit buildbot

After having set up the typical things open source projects needs like a website/wiki, mailing-list and bug-tracker, Maliit now also has something not so common: a build-bot.

As Maliit consists of several components that can be built in several different ways (and for several different platforms), we wanted to automate the build and tests of the different variations to ensure that we do not break any of them. This is especially important for variations which are not easy to test for the individual developers, like for instance Maliit on Qt 5.

The software chosen to help with this task was Buildbot. Getting an initial instance it up and running was very quick and pain-free, especially thanks to packages being easily available and the excellent documentation. The current setup now builds, tests and installs the two major components we have: Maliit Framework and Maliit (Reference) Plugins, in the most important build/config variations we have. A total of 12 individual build jobs, plus 2 meta-builds. The configuration for the instance can be found in the maliit-buildbot-configuration repository.

For security reasons the build-bot is not directly exposed to the Internet. Instead a script runs every 5 minutes to generate a static HTML website and publish on the public web-server: Maliit build-bot

Buildbot says: All green!

This gives us a minimal continuous integration system for Maliit, which for now will hopefully helps us avoid breakage. In the future, the usage of the build-bot might extend to include:

  • Automating the release process
  • Testing of merge-requests/patches before merging to master
  • Automated integration/system testing, complementing the unit-tests
  • Triggering external builders for packaging. OpenSUSE OBS, Maemo 5 Garage, etc.
  • Automating certain aspects of bug-lifetime. Resolving when fix is committed, closing on release if pre-verified, etc

 

Combining PySide and PyGObject introspection bindings

Some while back I added basic GObject Introspection support to GEGL and GEGL-GTK master a while back. This will* allow application developers to write their Gegl + Gtk based applications in any language supported by GObject Introspection, like Python, Vala and Javascript. For GeglQt, the Qt integration library for using Gegl in Qt based applications, it was natural to use PySide to provide Python bindings for it. The initial setup was quick and easy, thanks to the binding tutorial, but there was one challenge.

The current widgets provided by GeglQt are for displaying the output of a node in the GEGL graph. Therefore they have methods with the following signature to hook up it up:

From gegl-qt/nodeviewwidget.h
GeglNode *inputNode() const;
void setInputNode(GeglNode *node);

GeglNode is a GObject (from the C based glib) subclass, and without help the bindings generator (Shiboken) does not know what to do with it so the method cannot be bound. PySide could have been used to also generate bindings for Gegl itself, but what we actually want to do is to make use of the existing PyGObject based bindings.

Marcelo Lira on #pyside let me know that this should be possible by adding some annotations to the typesystem.xml file, and implementing a Shiboken::Converter<T>. It is indeed possible, and for the above type looks something like this:

From typesystem_gegl-qt.xml
<primitive-type name="GeglNodePtr">
      <conversion-rule file="geglnode_conversions.h"/>
      <include file-name="pygobject.h" location="global"/>
</primitive-type>

From geglnode_conversions.h
namespace Shiboken {
template<>
struct Converter<GeglNodePtr>
{
    static inline bool checkType(PyObject* pyObj)
    {
        return GEGL_IS_NODE(((PyGObject *)pyObj)->obj);
    }

    static inline bool isConvertible(PyObject* pyObj)
    {
        return GEGL_IS_NODE(((PyGObject *)pyObj)->obj);
    }

    static inline PyObject* toPython(void* cppObj)
    {
        return pygobject_new(G_OBJECT((cppObj)));
    }

    static inline PyObject* toPython(const GeglNodePtr geglNode)
    {
        return pygobject_new(G_OBJECT(geglNode));
    }

    static inline GeglNodePtr toCpp(PyObject* pyObj)
    {
        return GEGL_NODE(((PyGObject *)pyObj)->obj);
    }
};
}

The PyGObject C API and the GObject type system is here being used to implement what Shiboken needs. The attentive reader will note that GeglNodePtr is used and not GeglNode*. This is a simple “typedef GeglNode * GeglNodePtr“, which looks to be neccesary with current PySide (1.0.6) to avoid it being confused by the pointer. Hopefully that is fixable and won’t be necessary in the future.

With this solved, I committed the initial Python support to GeglQt master yesterday. It contains a trivial Python example showing the usage. Some build cleanups, binding generator tweaks and testing remains to be done, but expect Python support to be a prominent feature for GeglQt 0.1.0

 

* There are still a lot of GObject Introspection annotations missing in Gegl. See the tracking bug. Help wanted!