Messing with neural networks.
It’s both easier and harder than I expected. Recommended if you’re as lost as me in this world.
What do I mean by that? Using an API like Keras and a framework like TensorFlow really lowers the amount of thinking you would need if you were to build your AI model, neuron by neuron. Most of the terms you’d hear are already implemented on Keras and TensorFlow, no need to do the math (although I still think it is important to do some research, so you know what’s going on inside those pretty classes like Sequential(), Conv1D(), LSTM(), etc. More on that later.)
—And what’s the hard part?
Well, you’d expect the model to be in charge of most of the work. And, if you consider “work” to be your goal, yes, it is! But for someone, as me, that doesn’t have a lot of experience in this, there is more than getting your data ready and know what you want (because sometimes, you don’t totally know what you want). Take a seat, this will take a while.
Preprocessing data
As mentioned before, the specific input for the model are the features extracted from the audio files and the phonetic labels of for the audio on a different format than the .lab file you might use for training an NNSVS voice.
First of all, the data has to be prepared as vectors, matrices, tensor. If you don’t know what that is, they are mathematical structures that contain data, where each dimension represents a property, characteristic, feature, or time position…
Think of a vector like a list. A vector has only one dimension, the order of 1, 2, 3 (in computation, most structures start with position “0”, though). This is the case for the labels. We can think of the vector to be a list over time for the phonemes of the audio. At time 0, phoneme is Silence. At time 2, phoneme is still Silence. At time 4, phoneme is [a], and so on.
Matrices are two dimensional structures instead. Just like tables. in the case we have, the input data is two-dimensional (Hm…). Take another look at the picture above. The row 0 has multiple “possibilities”. You can look at the f0 column or at the amplitude one. Therefore, we have two dimensions. Time and feature.
Tensor takes it one step further. The term itself has a way more complicated definition, since it has uses in different disciplines like calculus, physics, algebra. But in this case, we can think of it as a structure with three or more dimensions. A tensor we would use is the two-dimensional features.
— But aren’t you just saying they are 2 dimensional features?
Yes, but similar to how the 1-dimensional (1D from now on) features transform into a 2-dimensional (2D) object when joined together. 2-dimensional features like MFFCs, their delta (rate of change over time, which involves differences and calculus) and their acceleration (the delta of the delta) have time and MFFC (MFFCs here is actually another collection of MFFC), so when stacked together, it turns into a 3D object, more accurately, a 3D tensor.
All of these will get another dimension: sample. Why? Because when you feed the model, you actually input a collection of those samples. For example, the input for the 1D features will be a 3D tensor with the following dimensions: Sample, Time frame, Feature. And for the 2D features, the input will be a 4D tensor with dimensions Sample, Time Frame, MFCC index, Feature.
# The following will access the value for the 2st feature
# (amplitude_envelope) in the 5th time frame for the 1st
# sample. Remember that indexes start in 0, not 1.
x_train_1d_features[0][5][1] # If there are three indexes, is a 3D tensor!
# Here, the following will access the value for the 3rd feature
# (MFCCs acceleration, in this case), for the 4th MFCC index in
# the 11th time frame in the 1st sample.
x_train_2d_features[0][10][3][2] # If there are four, is a 4D tensor.
Finally, for the training results there is also a different approach than just inputting your label file.
// The model assumes you use a NNSVS format .lab file, there each unit
// equals to 100ns (nanoseconds).
// Format is:
// [start of the phoneme] [end of the phoneme] [phoneme]
// The spaces are necessary.
0 235000 Sil // This is: From milisecond 0 to milisecond 23.5,
// the phoneme is [Sil] (Silence)
235000 350000 a // From milisecond 23.5 to milisecond 35.0, the phoneme
// is [a]
This data is then processed into a 3D tensor with dimensions for Sample, Time frame, and Phoneme.
— What? Phoneme? How is that a dimension?
Yes. If you remember, the model actually outputs a series of probabilities, not a single result. However, the app’s output needs to be a phoneme for each time frame, not just probabilities. Before exporting, we collapse the probabilities to a single phoneme — the one the model has the highest confidence in. Do you see where we are going with this? Not yet? Probably, I’m not that good foreshadowing.
We have a phonetic set. Internally, it’s just a list. That means, each phoneme is assigned an index. [a] is 0, [i] is 1, and so on. We translate the phonemes to their indexes. Now, the stream of labels becomes integers. This is called integer encoded labels and it could be the actual output, but we take another step.
These integers essentially serve as indices. What does that mean? We can convert them to a vector per each time frame. We have now the 3D tensor.
# Here, this returns the confidence of the model on the
# 4th phoneme ([o]), at time frame 10, for the 1st sample.
y_train[0][10][3] # Similar to before—three indices, three dimensions.
This process is known as one-hot label encoding. If the model was to use integer encoded labels, it could only output the phoneme it has the most confidence on, not the confidence for each phoneme.
All of this is extra work you don’t think you’ll have to do at first. Well, at least it wasn’t something I expected. But the good part is that is something you learn once. When improving this model, you know already what your input layers expect and how do you have to preprocess your data. And for other models, you get the overall flow of data processing, so you aren’t as lost as the first time.
Layers, neurons, activation functions, hyperparameters... What?
Do you remember the definition of them on the previous article? No? A quick recap. This time, more detailed though.
The neurons are the fundamental unit of neural networks, just like biological neurons are the fundamental unit for our nervous system. In our neural network model, each neuron performs a calculation based on the input, weight and bias.
I could just type the mathematical equations that a neuron would perform, but that might just confuse you if you’re not totally committed to learn about machine learning (nor I’m an expert or proficient enough to teach you), so I would suggest you check 3Blue1Brown’s series on neural networks. All you have to know is that the neuron receives various inputs from neurons before it, makes a weighted sum and adds a value of its own (bias). After that, it feeds the value to an activation function and outputs the result to the neurons after it.
Now, most neurons are grouped in layers. The outputs of the neurons on a layer are the inputs of the next layer and so on. There are different types of layers with different purposes. Right now, I’ll list the ones I’m using until now on the model, but will explain more as I start using them. For convenience, I’ll use Keras’ classes so you can later recall them quickly.
- Dense: The most basic layer. It will receive all of the previous layer’s output and will output to every neuron on the following layer. You could think it as the “default” layer when talking about neural networks.
- LSTM: Almost like a Dense layer. But there are horizontal connections between the neurons in the same layer and small memory cells. So, when you feed the layer with a sequential data (all the time frames of a sample), the first neuron will process a nth timeframe, output to the next layer and provide that output to the neuron on the layer that would process the nth+1 timeframe.
- Conv1D: A different type of layer. It uses a kernel to convolute over the data, in this case, on a single dimension. What’s that? It takes a group of (contiguous) data, applies a calculation and outputs a value. It repeats that process across all of the input data and passes the results to the following layer.
- Conv2d: Like before, but instead of 1D data, is applied over 2D data, in this case, over our 2D features (MFCCs, delta and acceleration). A 2d kernel is applied over the 2-dimensional data, calculate an output and fed to the next layer.
Once again, take this as an oversimplification of what these layers really do. There are a lot more people with expertise in this field that will teach you better than I’ll ever do.
Hyperparameters
Such a flashy and cool sounding word. Remember the weights and bias used on the neurons? Those are parameters. They’ll be fine tuned as result of the training process. Meanwhile, hyperparameters will be adjusted manually. By who? By you. Or your team. But it’s more likely that if you’re reading this, you’re a solo developer on your own personal journey.
Some hyperparameters you’ll hear about are learning rate, epochs, batch size, activation function, optimizer, loss function. Let’s break down each one of them and we’re done. Promise.
Learning rate
As you know, the model learns by training. You input some values, the model processes and it compares the predicted output to the expected output. The process of learning is done when calculating the error between both values and calculating how to improve it by adjusting the parameters of the model. Have you ever played Minecraft? If you ever used the ender eyes to search for the stronghold, you know they give you a direction, not a value. You decide how big steps you’ll take each time you throw a pearl. If you walk too little, you’ll spend lots of time (and eyes) to reach your destination, but if you walk too much, you might miss the stronghold and will throw another eye, missing again. Adjusting the learning rate is deciding if taking cautious small steps or bold large leaps to reach your goal. And yes, you can adjust the learning rate during training.
Epochs
Is the number of times the whole dataset will be fed into the model for training. More epochs usually mean better results, but too much can lead to overfitting, as the model might become too comfy at predicting the training data, but won’t perform well on new data (during inference).
Batch size
Is the number of samples in a batch, that is a group of samples/sequences that will be passed at the model for training and after each one, the parameters will be adjusted. Here you have to aim for a balance too. Bigger batch sizes are more efficient but won’t be as good as smaller sizes, yet this ones are less efficient.
Activation function
I wanted to avoid maths as much as possible but here we go.
An activation function is a function like any other, where you will input a value, the result of a neuron, and will output a different value. The intention behind it is to avoid linearity on the model. If there wasn’t an activation function, the whole model would collapse to a linear one. There are different activation functions, like ReLU, softmax, tanh, etc. You can search more about them on your own. They’re not as complex as they might sound.
Optimizer
Is the responsible of the training of the model. It calculates (insert math terms) and based on that, indicates the “direction” in which every parameter has to be adjusted (using the learning rate we talked about before).
Loss function
Is the function in charge of calculating the difference between the expected result and the predicted one. Based on the task the model has to perform, loss functions change. You could even write your own, but yeah, you definitely need a lot more understanding of both your goal and maths to do so.
As promised, we’re done! This could be considered as the continuation of the introduction, more needed if you’re not familiar with machine learning and neural networks concepts.
Like always, I invite you to keep researching on your own if you’re really interested in the topic. What I explained can be used as a general guideline of what everything does, since my goal is for you to understand what I’ve been doing and plan to do, but never to teach you what a neural network is, the maths and logic behind.
I suggest 3Blue1Brown’s series (again), FreeCodeCamp’s Deep Learning Crash Course and StatQuest’s series too. If you speak Spanish (or understand, at least), I suggest DotCSV’s series.
Next story on the journal will definitely be about the model itself. Code, results and more of my suffering. Promise.
Hugs, Yeww.