Audio & Video quality testingApril 20, 2022

Using Teachable Machine for Sound and Image Classification in AI Applications

Daniel ShijakovskiQA Engineer

Although it sounds relatively futuristic and modern, Artificial Intelligence (AI) is not a new concept by any means. In fact, the earliest substantial theoretical work in the field of artificial intelligence was done in the mid-20th century by Alan Turing, who is widely considered to be the father of artificial intelligence. And what once seemed limited to the world of science fiction, is now very much our reality.

Today, thanks to enormous technological advancements, we have self-driving cars, computers that improve breast cancer detection, and even computers that may one day replace human programmers (spoiler alert—they probably won’t). Even with these incredible inventions, there are countless ethical and moral questions related to AI that remain unanswered and today… we are not going to answer them, unfortunately.

However, regardless of our opinion on it, AI looks like it is here to stay, and what is surprising to many, AI has already made its way into all facets of our daily lives. That is why, we are going to talk about its use and how we can leverage it to our advantage. Specifically, we are going to talk about classification problems, introduce you to Teachable Machine, and go into a bit more detail about sound and image classification. Let’s get started.

Classification problems

There are essentially two types of classification problems:

Binary classification — classifies data into two classes (good/bad, yes/no, etc.).
Multiclass classification — classifies data into three or more classes.

Binary vs multiclass classification — Binary vs. multiclass classification

Such problems arise in an unexpectedly large number of situations, such as:

Customer behavior classification — determining the type of customer, based on the patterns in their interactions with our products.
Spam filtering — using algorithms to detect a spam email and filter it out from the actual useful mail we get on a daily basis.
Image classification — determining the type of object or scene is in a digital image.
Sound classification — analyzing the patterns (actually the spectrograms of the sounds) to classify different types of sounds.

One tool that has proven extremely useful in dealing with these types of classification problems is Teachable Machine, developed by Google. Let’s take a closer look at what Teachable Machine is and how it works.

Teachable Machine, nice to meet you!

What is this Teachable Machine? Glad you asked. Teachable Machine is a web-based tool that helps us quickly and easily create AI models that can then be used in all kinds of projects. It is designed in such a way that people of all skill levels can use its intuitive interface and make their own machine learning models “from scratch”. I put from scratch in quotes because we are actually not starting from scratch when we create these models.

Teachable Machine is meant for anyone who wants to use AI in their field of work, whether that be educators, artists, or students, without actually needing any prerequisite knowledge in machine learning. This is one of the main reasons why Teachable Machine is so versatile and accessible and has already been used in classrooms and prototype projects all over the world.

Under the hood, Teachable Machine is a library built on top of TensorFlow.js — a machine learning JavaScript library. You can check out their GitHub repo for any references and ideas.

The models that Teachable Machine produces are trained using a learning technique called transfer learning— a machine learning (ML) method where a model trained on one task is then repurposed on another related task. Basically, a neural network is trained on a large dataset and retains that knowledge for future use. We can now add additional data to train the model to our liking and this will require only a small change in the inner workings of the already trained model. Doing this requires significantly less computing power, as well as a small dataset and so the resulting product is our own custom model that is ready to go.

So, depending on what type of model we want to create, those models have already been pre-trained on some data (whether that is audio or video data), then we just add our own sample data, train the model and we are good to go!

The entire process of gathering data and training the model is done inside the browser, which has a distinct advantage over other methods of training models, one of the more significant ones being data privacy.

Since all the data is created locally (via a webcam or a microphone) and the model itself can be locally downloaded, we as users can rest assured knowing that our data won’t leave our computer if we don’t want it to.

Okay, so far so good. We can easily create these ML models, but the question now is: What kinds of models can we make and what can we do with them?

Well, as of April 2022, Google’s Teachable Machine supports three types of models with more models coming soon:

The image classifier — classifies images
The sound classifier — classifies sounds
The pose classifier — classifies human poses

In the next few sections, we will dive a bit deeper into the two main classifiers — the image and sound classifiers. We will look at a high-level overview of each model, get a sneak peek under the hood and then explain how they can be used in our own projects. Let’s begin!

The image classifier

Image classification can be applied to many different situations and problems in the world of computer science as well as in everyday life. Teachable Machine’s image classifier makes it really easy to create various models that can classify just about every type of object/scene given as a digital image. What is interesting about this model is how it leverages transfer learning to make the process of creating our model fast and easy.

The image classifier that Teachable Machine gives us is actually a pre-trained model, trained on as many as one thousand different image classes, called MobileNet. MobileNet is in its essence, a convolutional neural network (CNN) and it has been trained using the ImageNet database — a large scale hierarchical image database.

What this means is that the MobileNet model has already learned to boil down the essence of images into numbers, making accurate predictions about the objects portrayed in the images we input it. Therefore, we can use this trained neural network, provide it with significantly less data (our own sample data) and train it much faster than initially expected, to get astoundingly accurate predictions. Awesome!

The sound classifier

Similar to the image classifier, the sound classifier is extremely easy and intuitive to use. All we as users have to do is record some sample data on a microphone (either on an external one or on the one built into our device) and train the model in-browser, and we have ourselves a working sound classifier!

It is important to note that, behind the scenes, the sound classifier also receives images as its input to train, but these images are quite different from the classic image classifier.

What the sound classifier receives as input is the spectrogram of the sound we record — a 2D time-frequency representation of the sound, obtained through Fourier transform.

The actual convolutional neural network that has been pre-trained looks like this:

*Timeline chart of the sound classification mode*

In his article, TensorFlow Developer Advocate, Khanh Leviet explains this process succinctly:

The model receives a spectrogram. It first processes the spectrogram with successive layers of 2D convolution (Conv2D) and max pooling layers. The model ends in a number of dense (fully-connected) layers, which are interleaved with dropout layers for the purpose of reducing overfitting during training. The final output of the model is an array of probability scores, one for each class of sound the model is trained to recognize.

Both the sound and image classifier use feature extraction for their models to learn, but as we can see, they do it in their own distinctive ways. Teachable Machine’s sound classifier has been trained using the TensorFlow Speech Commands Dataset, with a somewhat limited vocabulary of 20 words. These include the digits 0 through 9, along with the words "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", and "Go". A few more words were added for the testing of the model’s ability, but these 20 remain as the core of the model.

Okay, so we have learned a bit about how Teachable Machine works and what it can do for us — but how can we use Teachable Machine to actually create useful and helpful projects in the real world? This is what we will discuss next.

Teachable Machine and the real world

When figuring out how we can apply these awesome models that Teachable Machine offers us, first we must ask ourselves, what can we use to make our own projects? Namely, what platforms (programming languages, frameworks, technologies) are able to integrate Teachable Machine?

Firstly, all of the models can be integrated on any website, web or mobile app that runs on JavaScript (using TensorFlow.js). After training the model, we can export it locally as a .json file (multiple .json files), or use the helpful snippets that are provided for us.

Furthermore, in addition to JavaScript support, Teachable Machine provides support for Python projects (using TensorFlow), as well as Android applications (using TensorFlow lite).

But wait, there’s even more! Using TensorFlow lite, not only can we integrate Teachable Machine models into our Android applications, but we can also use them in various IoT projects, with the Image Classifier providing an optimized classifier just for microcontrollers such as Arduino.

So, it looks like we have an abundance of options to go with when considering our next groundbreaking project. The only limit to what we can do with Teachable Machine classifiers is our imagination.

Are you curious to see some more examples? Here’s a list of already built applications to tickle your inspiration — The Awesome Teachable Machine List. Here you can find starter projects such as the Bananameter, which uses the image classifier to tell if a banana is ripe or not. You can also find the Snake Game without the need to use buttons, making video games accessible to people with disabilities, one old school Atari game at a time. On top of that, you can also find plenty of resources, tutorials and libraries to help you out in your quest for the next game-changing AI application.

Do you have an application with audio and video capabilities? Want to make sure the quality is up to standard? We can help. Get in touch and let’s discuss your project.