audio command recognition with machine learning
It took a while, but I think, slowly I get a practical understanding of AI and Machine Learning. In machine learning, you basically use lots of computations and repetitions to somehow create a function, that hopefully has a complexity, that you are happy the machine came up with a solution rather then yourself spending hours and hours or even days and weeks. Very often, the problems are some kind of recognitions of pattern. You know a particular problem follow a pattern, but the circumstances are so diverse that you believe, that the usage of machine learning techniques can come to a result, that is good enough.
When learning about machine learning, you get confronted with many examples, that solve a problem, recognize words or objects from images, text from audio or over simplified problems, where it would be more easy to thing about the problem and come up with an algorithm yourself. The hope is always, that the algorithms applied in ML create a solution, cheaper, and potentially better, than a person could do.
The past weeks I was discussing some problems with my colleagues and I thought, you have a decent data source available, you could try some machine learning. It was about investing to crypto currencies or giving the computer commands, by analyzing audio.
What I find, is that machine learning will never go on its own. it is very effective, to implement many pre processing steps for filtering and normalizing the data, so that a neural network can work more effective with less resources on the difficult parts of the problem.
In some thought experiments and some research, I came up with a way, that could be applied to recognize commands in audio. The algorithm would look a little like the following:
Audio Command recognition.
First you need the audio data, recorded or live. for training we need recorded samples. lots of them, the good part is, that recording is fast. Second the normalization. A strategy that I would apply first is cutting on silence. Each part would become a potential word command. As we don’t, know how fast someone is speaking, I treat all parts as if they have the same length.
Each part we can now process with known algorithms already used in audio editing and alike. it is very popular to generate an audio graph. It is a bitmap, where one axis is the time, the other is the found audio frequencies, sorted from deep to high. depending on the accuracy we need, It would now be a good moment to reduce the resolution. Let’s say we cut the time of one slice of audio into 100 and the frequencies into 10 to 20. To achieve good results, I think you can simply play with these parameters. Each Pixel in the audio slice is going to be one input for the forward feeding network.
Here we are at step three. So far it was just data processing. Now we need the actual ML library. As I try to do most things is JS, I want to suggest first Brain.js. It is a pure JS implementation, can run in browser and node, in my first experiments I was satisfied with its performance, as it was very fast to try different settings. And it is very important, as Machine Learning engineer, to learn the effect of all the different settings. When trying if the training worked, you can run some live tests or prepared recordings that are not included in the training data.
More popular and also more complex than brainJs is the tensorflow.js library. So far I personally found its API to complex to be used effectively for actual application development. Now with the additional ML5 module, many use cases are automated and provided with reasonable API. Running all the machine learning algorithms on the users GPU, tensorflow.js can greatly outperform brain.js.
so this is basically it, how you could go and build some language recognition tool, with ML and JS.
deep learning
In the end of this post I want to provide to you one more very special thought. It is now you can build this techniques and even use the next generation of ML:Deep learning.
for the term deep learning I actually found multiple definitions. Personally I will adopt the second, but first things first: Deep learning is, when you use forward propagation networks, with multiple hidden layers. meaning that one network is deep. I think this definition is a little flat and the technique a little to simple.
The second is better. it is generating a network for generating networks, and find our what network is the best. Wow, seems like software engineers compose absolutely everything. functions, UI components, Block chains with site-chains and of course: stones.
But what does this definition really means? here it is: Go back in this post, find where you wrote about experimenting with arguments settings and parameters? So to do deep learning, you have to build the whole pipeline into a parameterized function or script. an other script could try different combinations for the arguments for you auto magically, you just leave the computer try many different settings, it might be busy for some time longer, but you could save the time doing it yourself, to tweak the configurations.
what do you think? will you try some ML?