In the below sections, we will by default use the following notations:

  • \(x_j^{(i)}\) is the \(j\)th input of the \(i\)th layer. In fully connected layers, all the inputs form a vector, thus we can use \(x_j^{(i)}\) to denote them. However, in convolutional layers, the inputs form a \(2\)d matrix, and in that case, we will use \(x_{kj}^{(i)}\) to denote the individual input. The formed vectors (in fully connected layers) or matrices (in convolutional layers) will be represented by \(X^{(i)}\).

  • \(\hat{y}_j^{(i)}\) is \(j\)th the output of the \(i\)th layer. If the \(i\)th layer is the last layer, there might be some corresponding ground truths. In this case, the ground truths will be denoted as \(y_j^{(i)}\). We will use \(\hat{Y}^{i}\) to denote the vector or matrix that contains all the outputs of the \(i\)th layer. Correspondingly, the vector or matrix of the ground truth will be denoted by \(Y^{(i)}\)

  • \(\ell\) is the value of loss function. In our classification model, we use the cross-entropy loss and it is defined as \(\ell = -\sum_{i}^{n}y_j^{(i)} log(\hat{y}_{j}^{(i)})\) where \(i\) is the index of the last layer. In regression models, it can be defined as \(\ell=\frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i-y_i)^2\) (mean square loss). In the below examples, we will by default use the mean square loss as it is easier to compute.

  • \(\nabla_{X}f\) refers to the derivative of function \(f\) with respect to \(X\), i.e. \(\nabla_{X}f=\frac{\partial f}{\partial X}\). In the following sections, we will use \(\nabla^{(i)}\) to represent the gradient of the loss value with respect to the output of the \(i\)th layer. Formally, we have \(\nabla^{(i)}\)=\(\frac{\partial\ell}{\partial \hat{Y}^{(i)}}\). Since the output of the \(i\)th layer is also the input of the \((i+1)\)th layer, we also have \(\nabla^{(i)}=\frac{\partial\ell}{\partial X^{(i+1)}}\). In case \(\nabla^{(i)}\) is a matrix, we will use \(\nabla^{(i)}_{kj}\) to denote the \((k,j)\) element in the matrix as well.

  • There are some other notations we need in the following sections: \(\mathbb{R}\) is for the set of real numbers, \(\mathbb{R}^{m\times n}\) is for an \(m\times n\) real matrix and \(\epsilon\) is a small enough real number.