Back Propagation¶
In this section, we will use a fully connected neural network shown as below.

Illustration of the simple neural network we will be using.¶
We will use the following notation to help us demonstrate the process. In the following, by default we will have
\(x_i\) is the input of the neural network. There are two inputs in our examples which are \(x_1=0.25, x_2=0.65\) and the bias \(x_0=0.30\).
\(h_i^{(j)}\) are the data in \(i\)-th node of \(j\)-th layer. Then the weights that \(j\)-th layer use to multiply with its input is denoted by \(w_{pq}^{j}\) where \(q\) is the \(q\)-th node of \(j\)-th layer, and \(p\) is the node in \((j-1)\)-th layer. For example, the weight between \(x_1\) and \(h_1^{(1)}\) are denoted by \(w^{1}_{11}\). In addition, we use \(b_j\) to denote the bias in \(j\)-th layer. We assume that we have \(w^{1}_{11}=0.20, w^{1}_{12}=0.25, w^{1}_{21}=0.30, w^{1}_{22}=0.35, b_{1}=0.40\) and \(w^{2}_{11}=0.50, w^{2}_{12}=0.55, w^{2}_{21}=0.60, w^{2}_{22}=0.65\). In our case, there is only one bias in the hidden layer \(h_0^{(1)}\) and we set it to be \(h_0^{(1)}=0.20\)
\(y_i\) is the \(i\)-th predicted output, and \(y_i\) is the corresponding ground truth. In our case, we have \(\hat{y_1}=0.99\) and \(\hat{y_2}=0.01\).
In this example, we will use mean square error, i.e. \(l=\frac{1}{2}\sum_{i=1}^{2}(y_i-\hat{y_i})^2\). We will use stochastic gradient descent (SGD) to optimize our weights, and in our optimizer, the learning rate \(\lambda\) is set to be \(0.1\). We will use ReLu activation layer after every layer.
With these parameters, we can perform the first forward pass:
Then after the relu activation these values remained since they are all positive, then
Then after the relu activation these values remained since they are all positive.
Then we can compute the loss with this iteration, \(l=\frac{1}{2}\sum_{i=1}^{2}(y_i-\hat{y_i})^2=\frac{1}{2}(0.8265-0.99)^2+(0.88325-0.01)^2=0.39465\). Then the gradient will be
After this, we want to compute the derivative of the loss with regards to the weight and bias in the output layer. We have
Then the gradient that we will pass to the previous layer, i.e. \(\frac{\partial l}{\partial h^{(1)}_{ij}}\) can be computed as below:
\(\frac{\partial l}{\partial h^{(1)}_{1}}=\frac{\partial l}{\partial y_1}\frac{\partial y_1}{\partial h^{(1)}_{1}} + \frac{\partial l}{\partial y_2}\frac{\partial y_2}{\partial h^{(1)}_{1}}=-0.1635 * w^2_{11} + 0.87325 * w^2_{12}=-0.1635 * 0.50 + 0.87325 * 0.55 =0.3985375\)
\(\frac{\partial l}{\partial h^{(1)}_{2}}=\frac{\partial l}{\partial y_1}\frac{\partial y_1}{\partial h^{(1)}_{2}} + \frac{\partial l}{\partial y_2}\frac{\partial y_2}{\partial h^{(1)}_{2}}=-0.1635 * w^2_{21} + 0.87325 * w^2_{22}=-0.1635 * 0.60 + 0.87325 * 0.65 =0.4695125\)
With these derivatives, we can update the weight and bias in the output layer, here we will use the learning rate \(\lambda = 0.1\). Then
Until now, we have successfully updated the value in the output layer, and afterwards, we will need to iteratively update the value in previous layers. Before working on the iterative process, in order to make the computation clean, we will use the matrix format. Here we already have the gradient that the output layer passed to hidden layer denoted by \(\nabla_2 = [\frac{\partial l}{\partial h^{(1)}_{1}}, \frac{\partial l}{\partial h^{(1)}_{2}}]=[0.3985375, 0.4695125]\). Then
\(\frac{\partial l}{\partial w_{ij}^{1}}=\frac{\partial l}{\partial h^{1}_{j}}\frac{\partial h^{1}_{j}}{\partial w_{ij}^{1}}\). Thus we will have
Then for the bias, we have \(\frac{\partial l}{\partial x_0}=\frac{\partial l}{\partial h_j}\). Therefore \([\frac{\partial l}{\partial x_{01}},\frac{\partial l}{\partial x_{02}}]=[0.3985375, 0.4695125]\).
With these parameters, we can update the parameters in the hidden layer.
Until now, we have successfully computed a forward pass, a backward pass and updated all the parameters in the neural network. With the same procedure, we can compute the new loss \(l=0.2384542802724947\). If the new loss is still unsatifactory, we can repeat the process again and again to lower the loss until it is satifactory.
The process can be easily implemented in tinyml with the following code snippets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | import numpy as np
import tinyml
from tinyml.layers import Linear, ReLu
from tinyml.losses import mse_loss
from tinyml.net import Sequential
from tinyml.optims import SGDOptimizer
tinyml.utilities.logger.VERBOSE = 3
x = np.array([0.25, 0.65]).reshape(1, 2)
y = np.array([0.99, 0.01]).reshape(1, 2)
print(x.shape)
print(y.shape)
fc_1 = Linear('fc_1', 2, 2)
fc_2 = Linear('fc_2', 2, 2)
relu = ReLu('relu_1')
fc_1.weight = np.array([0.20, 0.25, 0.30, 0.35]).reshape(2, 2)
fc_1.bias = np.array([0.30, 0.30])
fc_1._rebuild_params()
fc_2.weight = np.array([0.50, 0.55, 0.60, 0.65]).reshape(2, 2)
fc_2.bias = np.array([0.20, 0.20])
fc_2._rebuild_params()
print(fc_2.weight.tensor)
model = Sequential([fc_1, relu, fc_2])
optimizer = SGDOptimizer(0.1)
model.summary()
epoch = 1
for epoch in range(epoch):
y_predicted = model.forward(x)
loss, loss_gradient = mse_loss(y_predicted, y)
print('>>>>')
print(loss)
print(loss_gradient)
print('<<<<')
model.backward(loss_gradient)
model.update(optimizer)
|