## Notes on Deep Learning with R Chapter 2

Chapter 2 - The Mathematical Building Blocks of Neural Networks

These are my notes on Chapter 2 of *Deep Learning with R* by Chollet and Allaire.

# Intro to Neural Networks

In machine learning, a *category* in a classification problem is called a *class*. Data points are called *samples*. The class associated with a sample is called a *label*. The network learns how to classify using *train* samples to predict train classes. The actual classes of the train data is known; we use them with a *loss function*, *optimizer*, and *metrics monitoring* to train our network. This is called the *compilation step*. The network is then applied to *test* samples to predict unknown classes. We can check the accuracy (or any metric of our prediction) for our test and train samples. If the train sample has significantly better prediction than the test sample, we have *overfit* our network. Its construction is too reliant on the train data.

Each layer of a network extracts *representations* of the data fed into it. Many simple layers are chained together to refine the data and finally make a prediction.

# Data Representation

Most machine learning models use *tensors* as their data structure. “Tensors are a generalization of vectors and matrices to an arbitrary number of dimensions”. A scalar is a 0 dimensional tensor, a vector is 1 dimensional, and a matrix is 2 dimensional. A tensor has three key attributes.

- Number of axes (rank)
- Shape
- Data type

# Tensor Operations

Element-wise operations are applied independently to each element of the tensors. One example is element-wise addition of two tensors (with the same dimensions). This is highly parellelizable.

Some operations combine elements such as a dot product. This can return an output tensor with different dimensions from the input tensors.

Other operations change the order of elements while leaving the total number unchanged. One example is reshaping a tensor.

Neural networks are built of layers of simple tensor operations.

# Gradient-Based Optimization

A neural network runs many times to improve its performance. Each time it runs, it updates the *weights* of each layer. These are initially set randomly and then adjusted (the *learning* in “machine learning”) using the training data and a feedback signal. This iterative process is called the *training loop*; it has the following steps.

- Draw a batch of training samples
`x`

and corresponding targets`y`

. - Run the network on
`x`

(a step called the*forward pass*) to obtain predictions`y_pred`

. - Compute the loss of the network on the batch, a measure of the mismatch between
`y_pred`

and`y`

. - Update all weights of the network in a way that slightly reduces the loss on the batch.

Since we are drawing samples randomly in step 1, this is called *minibatch stochastic gradient descent*.

The 4th step is complicated. You cannot change the weights one at a time because that would be very inefficient. Instead, we take the *gradient* of the loss with respect regard to the coefficients. We take the gradient at the point of our weights and use this to adjust the weights.

```
updated_weights = weights - step * gradient(loss(weights))
```

Here the loss function is the difference between our predictions and our expected values. Thus, we subtract because we are trying to update the weights to get this difference to be smaller. We are hoping to get the gradient to be 0 (i.e. we found the minimum of the loss function). The `step`

is small adjustment factor needed because our gradient only approximates curvature at `weights`

, so we do not want to move too far away from it.

One issue that arises using this method of optimization is addressing local minima. To avoid getting stuck at a local minimum, we can implement the idea of *momentum*, which uses previous weight adjustments in adjusting the current weights.

Notice that taking the gradient of a multilayered network would require the chain rule (as each layer is the input to the subsequent layer). This is often implemented using *backpropogation*.