Neural Network Notes

These notes are mostly derived from Neural Networks From Scratch series found here and textbook that I would highly recommend a read in order to learn about NNs (Neural Networks).

Short summary of Neural Networks from the series

Basic description of NNs

Neural networks are basically hyper-complex math functions that map an arbitrary number of inputs to an arbitrary number of outputs. For example, the pixels values of a photo could map to two output neurons that represent cat or not cat.

Networks consist of layers of neurons each having a name such as the input, output and hidden layers. With hidden layers there can be as many as deemed necessary.

Each neuron in a fully connected layer takes the output from each neuron in the previous layer and multiplies it by a weight. These values are then all summed and a bias is added.

Surface level overview of training

Creating this math function involves initialising the neurons to a random value and then testing a dataset on it. The point of which is to calculate loss or how wrong the network is. Using this can allow for an optimiser to change the weights and biases of each neuron to hopefully get a more correct answer.

Networks are usually trained on batches of 32 samples at time which allows for the optimiser to get a more general trend from the data. Whereas if it were just training on one sample it would over-correct for each sample and would be less likely to find the patterns that apply for the data generally.

Activation functions

NNs as currently described are flawed however, as there needs to be a means of adding non-linearity to the network or else it can only map to linear functions.

This is where the activation function comes in, which is a function applied to the output of a neuron after the inputs have been multiplied with weights, summed and the bias added. Two or more hidden layers are required to map to non-linear problems. A different activation function is typically applied to the last layer before the output layer such as the Softmax activation function.

One such activation function is the step function where x <= 0 = 0 and x > 0 = 1. This is not considered desirable as information is essentially lost due to the limiting of the value to either one or zero.

Another is the sigmoid function which maps the input to a scale between 1 and 0 and allows for more granularity. However it introduces a problem in training the network called the vanishing gradient problem where the gradient of the error function becomes increasingly small. Meaning that the weights of neurons will be prevented from changing value, effectively stopping training. There is a similar but opposite problem called the exploding gradient problem.

One of the most popular (and most popular currently) is the Rectified Linear Unit function or ReLU for short. It is simply max(x, 0). That’s it. The removes the vanishing gradient issue of the sigmoid function and is efficient to calculate among other benefits.

The Softmax function takes the inputs and maps them to a probability distribution, which is just a fancy way of saying that they all add up to one but are in proportions, i.e. the largest output will be the largest portion of the sum that adds to one. The assumption here is that the “most activated” output neuron is the answer that the network is giving. In a perfect answer one neuron would one and the others zero, meaning that it has fully confidence in the answer it is giving. The ReLU function is not used as the last as it would be difficult to interpret meaning from the output as the output would either be zero or some positive number that is unbounded. Hence, some way to compare them relatively is needed.

To solve this problem the Softmax function can be used, which will be derived here. To remove negative numbers from the output the exponential function can be used: y = e^x. This means that increasingly negative numbers will increasingly approach zero when applied. Next, the values need to be normalised or in plain english converted to the probability distribution mentioned before where the output values all add to one. This is done by dividing each value by the sum of all values, e.g. norm([1, 2]) = [1/(1+2), 2/(1+2)] = [1/3, 2/3]. The combination of these two steps of exponentiation and normalisation is the Softmax function. A limitation of this naive implementation is that it is prone to integer overflow due to the exponentiation step, as relatively small numbers such as 1000 will crash the program. To combat this the max value from the outputs is subtracted from all the values, ensuring that they still have the same proportions relative to each other but limiting the scale of the output neurons to (-inf, 0]. This doesn’t change the final output but does prevent the crash from occurring.

Calculating loss

When training a NN it might be intuitive to just say that the output is either correct or wrong. However, if it is just a simple binary output like this there is a lot of useful information about how wrong the NN is. Hence, why a probability distribution is used instead of a simple binary one or zero on the output neurons. For instance if the NN is only 50% confident on a two output network, it clearly means that it has no idea what the correct answer is. This information can be used to inform the optimiser that the values need to be heavily adjusted.

A method of calculating the total loss or error is to use Mean Absolute Error, where the absolute (positive) difference between each given answer and the correct answer is summed to give the total error in a batch of samples. In this instance, the perfect network would have a total of zero as there is no difference between the output and the correct answer. Another such function is Mean Squared Error, however these are both useful for networks that do regression, i.e. finding a relationship in a dataset if one such exists or more simply “fitting a curve”. The example given in the series is fitting a NN to a sine function, where for a single input value into the NN a single output value would be generated that maps to the sine wave function.

For determining what is inside of a picture, for example, a different loss function is needed that can relate the probability distributions as that is what is output by classification NNs.

Categorical Cross-Entropy loss function

Typically for NNs doing classification with a Softmax activation function on the outputs is to use Categorical Cross-Entropy to calculate loss. Now this function simply just gives a loss value to be used, it is not necessary to understand the mechanics of how it works. Obviously however it is good to know to have better understanding of why it is effective for calculating loss and training a NN.

Currently for calculating loss there are two vectors, the vector containing the outputs of the NN, and the vector containing the correct answer from the training datasets also known as the ground-truth vector. These two vectors represent two probability distributions, and entropy measures the “distance” between two probability distributions in the number of additional information bits required to “encode” one distribution to the other.

The ground-truth vector will be a one-hot vector; a vector containing only one one denoting the correct answer as a 100% confidence value, and the rest being zeroes denoting the incorrect answers.

To perform the original equation the logs of each of the outputs of a NN are multiplied with the corresponding value in the ground-truth vector (as they are same size vectors), summed and then multiplied by -1. However, due to the one-hot vector this equation simply becomes the negative log of the output neuron that contains the correct answer.

One thing to note is due to the nature of the logarithm the lower the confidence is for the correct answer loss will be exponentially larger.

Notes from the Book

Each title refers to a chapter of the book and contains the notes I’ve made for that chapter.

A brief history

What each sensor measure in this example is called a feature. A group of features makes up a feature set (represented as vectors/arrays), and the values of a feature set can be referred to as a sample. Samples are fed into neural network models to train them to fit desired outputs from these inputs or to predict based on them during the inference phase.

Each neuron has a set of inputs which have a linear function attached to them.

output = sum(input * weights) + bias

Where input and weights are a N dimensional array depending on how many inputs there are to the neuron. Biases is a scalar number added to output sum of the neuron.

Data fed into an NN is usually preprocessed in a way so that each of the input values are between 0 and 1 or -1 and 1.

NNs are trained by slowly adjusting the weights and biases based on the error of the NN which is also called Loss. The idea is that eventually the NN can generalise to out-of-sample data and correctly predict the desired result.

Coding our first neurons

A list that is used as an input is called a feature set but is also called a feature set instance, observation or most commonly a sample.

Batches are used to train a NN in parallel as it is faster but also stops the NN from overfitting such as if one sample is used at a time. This will more gradually shift the NN towards a set of weights and biases that will fit the entire dataset.

Glossary

Activation function: A function usually applied to the output of a neuron. Serves to constrain the output somehow and but also introduce non-linearity such that the neural networks of two or more hidden layers can map to non-linear functions.
Backpropagation: An algorithm used to calculate the loss gradient for a feed-forward neural network. This allows the use of gradient methods for training neural networks in order to minimise loss.
Classification:
Deep neural net: Neural Network consisting of two or more hidden layers
Ground-truth:
Hidden layer: A layer in a Neural Network between the input and output layers
Labels:
Overfitting: When a Neural Network fits the training data too well and doesn’t learn the relationship between the input and output, akin to memorising the training data. Meaning that the network cannot be generalised to new data.
Regression:
Supervised learning:
Unsupervised learning: