# Neural Network from scratch: Part 1 – Theory

Welcome ! In this tutorial we will go through all the details of neural networks so we can build our own in C++ from scratch !

# Why ?

Imagine for a moment you want to tell a computer how to recognize a bird. How would you do it ? The most common answer would be to check the shape, the size, the colors. Does it have a beak ? Does it have wings ? etc. Suppose you can describe all these features with a bunch of *if *conditions. What if the picture is now taken from a different angle ? What if the pictured is zoomed in/out ? What if the bird color changed ? What if the bird opened its wings ? If you really want to do it this way, you would have to stack endless *if else *conditions, the program would be for a single type of bird and it would probably not work. There are endless possibilities! It is simply not possible this way.

A bird recognition problem may not seem that interesting but it belongs to a larger category which includes :

- image recognition
- speech recognition
- character recognition
- and many others...

These problems are all **classification problems** i.e. given a certain input, the machine should select a category to which it belongs (for instance: bird, dog, cat... for images). These problems seem extremely simple for us, but they turn out to be the most complicated to implement on a computer. The reason of their complexity is mainly that you can't describe them easily. Instead of trying to hardcode the problem, we are going to give samples to the computer and tell it *"These are birds. Find the pattern".* The program is going to **learn from the examples** we're giving it, and then should be able to tell if a picture it never seen before represents a bird or not. This, is called **Machine Learning.**

**N.B.** The artificial intelligence we are going to code will only be able to solve classification problems. But the very same techniques that we are going to see are used in other machine learning algorithms. If you understand this tutorial you should be able to apply what you learned elsewhere.

# Machine Learning

The machine is going to learn the pattern itself. But how ? The answer is **Neural Networks**, NN for short. They are at the core of most machine learning algorithms, and probably the most used technique today !

The first idea was to make a model inspired by the way our visual cortex works. After-all we're the result of millions of years of nature evolution ! I'm going to show you this model, but you have to keep in mind that this is a very poor model (compared to the way our brain really works), and to say "neural networks works like your brain" would be actually pretty false.

# Artificial neuron

This is called an **artificial neuron**. A neural network is made up of these (hence the name). As you can see, it is not very complicated, but there are some things to emphasize :

- This is one neuron. Alone it is not effective, but when you stack a bunch of them together you can start to learn things
- The neuron has one or several inputs and
**only one**output. - The are the input of the neuron, they could be :
- pixels values for example if the neuron is at the beginning of the network
- outputs of other neurons connected to this one

- The (for weight) and for (bias) are the
**parameters**of the neuron (the values we will have to find) - The weight quantify how much an entry is important in the final output
- The bias act just like an offset since the weights multiply the input, and it helps the network to be trained faster
- The output is a
**function**of the weighted sum of the input + the bias

Let's talk about this function. It is commonly called the **activation function** and it adds a **non linearity** to the output. If the function wasn't there, a whole pack of neurons connected to each other would be equivalent to only one neuron with specific weights and therefore a simple weighted sum as an output. Basically, a neural network is just trying to **find a function that fits a given set of points** ; if the output is linear, it means that a simple line equation fits your data, which is not the case in the problems we discussed earlier, thus we need the nonlinear property.

There are several activation functions used, but historically the first one was the **sigmoid** function :

It's not the litteral formula that is the most important, it's the **graph**. The function is **asymptotic in 0 and 1**, and it has a nice smooth shape so it is **differentiable**. Now you may wonder why do we need it to be differentiable ? Keep in mind that we want to find the right values of our **weights** and **bias **so that the **output of the neuron matches the desired output**. To do this, we could randomly change the values of and and see how behave the neuron but it would take too much time, there are endless possibilities especially when your network is dense.

The clever way is to write down a **loss function** (for example desired output minus actual output) and calculate the derivative of the loss function with respect to the parameters. Basically, the result of this calculation tells you how is going to change the error if we change the parameters. So we don't actually have to change them to know, but more importantly, we know how to change them to decrease the error ! (The derivative of a function is positive if the function increases and negative if it decreases). Therefore, if we want to decrease the loss function, which means **desired_output - actual_output = 0** and hence **desired_output = actual_output**, all we need to do is to follow the derivative where it's the lowest ! This process is called **gradient descent**.

**Vocabulary :** In Mathematics, when dealing with one variable functions, we talk about **"derivatives"** ; when dealing with multiple variables functions, we talk about **"gradient"**. The gradient is a **vector **and it points toward the direction that will increase the function as quickly as possible.

**P.S.** The fact that the output of the sigmoid function lies between 0 and 1 is a nice property because we can consider the output of every neuron as a **probability**.

## Matrix

One last thing before we really dive into the math ! Right now the output of a neuron is :

This notation is not very convenient. There is a very powerful mathematical tool which we are going to use instead: the matrix.

If :

Then :

Where means that is applied to every element of .

**Reminder 1** Let and be two matrices. The dot product can be calculated if and only if the of is equal to the of .

**Reminder 2** Let be a matrix of size and a matrix of size , then the dot product is of size .

# Neural Network

As I said earlier, a single neuron cannot do much, but added together they can solve great problems ! We are going to learn how to make a multi-layer neural net. Here is how it looks like from the "neurons" perspective :

The network is splitted into three parts called **layers**.

- The
**input layer**takes the input data (image pixels for example). This layer contains as many neurons as there are inputs (pixels in our case). - The
**output layer**gives the result of the network. This layer contains as many neurons as there are output categories. - The
**hidden layer**is everything inbetween the input and output layers. You can have more that one and the number of neurons in this layer depends on the problem, it is experimental.

The neuron layers on the image above are called **Fully-Connected Layers** because every neurons of a layer are connected to every neurons of the previous and the next layer.

## Forward propagation

We will call the input, hidden and output neurons respectively , and . The input<->hidden weights and hidden<->output weights respectively and . Hidden and output bias respectively and :

is of size

is of size

is of size

is of size

is of size

Where,

input neuron count

hidden neuron count

output neuron count

(1)

(2)

At this point we calculated what we call the** forward propagation**, in other words the output. Note that every dot product we're doing are possible. Next step is **backpropagation.**

## Backpropagation

Backpropagation is the reverse path. This is when we update the parameters, otherwise they will never change and the network will output the same result every time.

We need to calculate the derivative of the loss function with respect to and . But first we have to write down the loss function :

(3)

This function is called the **squared error loss function**, where is the desired output.

You probably wonder why there is a power of 2 ? And why is there a factor ?

Since we are going to follow the steep of the function (the derivative) to find the minimum of the loss (which means actual output = desired output), we don't want to be stuck in a local minimum, but the global minimum. Take a look at this random function :

Here we want to be in the second minimum which seems to be the global one. For this reason, we add a power of 2 to the loss function because is a convex function (one minimum). In regards to the , it simply cancels with the power when we differentiate the loss. Now we can calculate the derivative of the loss function with respect to the parameters. We'll start with the second bias.

(4)

with :

Keep in mind that the result of has to be of the same dimension as (Since we want the derivative of the loss function with respect to each element of ). Therefore, the above product is a scalar multiplication (element to element). Now the second weight matrix.

(5)

The result of has to be of the same dimension as , therefore we have to take the **transposed** matrix of and compute a **dot product**. The transposed operation is a kind of flip operation : each element of coordinates becomes . Therefore, it transforms a matrix of size into a matrix of size .

(6)

(7)

That's it ! We have our four final equations for backpropagation :

## Update parameters

This is the last piece of the puzzle. So far we calculated the output of the network, then we calculated the derivative of the loss function with respect to the parameters so we know how to change the parameters to reduce the loss. The final step is to actually change the parameters.

Where is an arbitrary constant called **learning rate**. Okay, but why do we subtract the derivative ? Here is a little graph to understand :

Since our loss function is convex like it's okay to make an analogy with this graph. As you can see, when , and when , . Hence :

- if then resulting in pushing to the right of the graph
- if then resulting in pushing to the left of the graph

In both cases we apply a change to so that goes toward zero.

That's it ! All we need to do now is to iterate through this process and the error should decrease !

In the next part we will implement our neural network in C++ to make a hand written character recognition AI !