Preliminaries¶
Convolutional Neural Network: Convolutional Neural Network (CNN) is a class of neural networks, and has been proven to be effective for most computer vision tasks. In a CNN architecture for image classification, there are usually three important components: the convolutional layer, the pooling layer and the fully connected layer. The first two types are designed to extract high-level features from images, and the fully connected layer can be used as a classifier to output the classification results. In a convolutional neural network, the convolutional and fully connected layers are equipped with parameters called weight and bias, which will be updated during training. In this work, we implemented these components along with other necessary components, such as activations (ReLu function), losses (Cross-Entropy Loss), etc.
Deconvolution: The two fundamental components of the CNN are convolutional layers and pooling layers, which works together to transform images into feature maps. Deconvolutional operation is a transformation that goes in the opposite direction of a normal convolution operation, i.e. from the feature maps that we extracted with convolution operation to something that has the shape of the input to it. After being introduced, deconvolution has been used in many fields such as pixel-wise segmentation, generative models, etc. In this work, we use deconvolution to map the intermediate feature maps back to the input pixel space. With this approach, we can show the relationship between the input patterns and the activations in the feature maps. There are also two components in the approach, called deconvolutional layers and unpooling layers, and we will explain these two concepts in more detail in Section 3 Approach.
Stochastic Gradient Descent: In neural networks, we want to find a function \(\hat{y}=F(x)\) such that \(y-F(x)\) is minimal. The function that we are looking for is usually non-linear, non-convex and there is generally no formula for it. As a result, gradient descent becomes one of the most popular methods to find the local minimum of the function. The method is based on a fact that the function \(f\) decreases fastest along the direction of the negative gradient. Formally, we can define a function that measures the difference between \(\hat{y}\) and \(y\), for example \(f=y-\hat{y}\) and assume that \(a\) is the only parameter in \(f\). Then if we let \(a_{n+1}=a_n-\epsilon\nabla_af\) and we want to find the lowest value of \(f(a)\) around the point \(a\). Then if \(\epsilon\) is small enough, we will have \(f(a_{n+1})\leq f(a_n)\) and \(f(a_{n+1})\) is the smallest value around a small enough interval of \(a\). Considering this, if we want to find the local minimal of the function \(f\), we can start at a random point \(a_0\), and follows the negative direction of the gradient. With this approach, we will have a sequence \(a_1, a_2,\cdots a_n\) that satisfies \(a_{n+1}=a_n-\epsilon\nabla_af\). Then the output of the function \(f\) will satisfy the rule that \(f(a_n)\leq f(a_{n-1})\leq\cdots \leq f(a_{0})\). By doing so, we could find an approximate value \(a_n\) such that \(f(a_n)\) is the local minimal.
Backpropagation: In the process of gradient descent, we found that we need to compute the gradient of our function \(f\) in every step. Backpropagation, as an application of the chain rule, is an efficient algorithm for calculating the gradient in deep neural networks. In short, it first computes the gradient of the loss function to the weight of the last layer in a neural network, and passes the gradient of the loss function to the input of the layer to previous layers. There are two bases for the algorithm:
In the \(i\)th layer, we can receive the gradient of the loss \(\ell\) with respects to the output of \(i\)th layer, i.e. \(\frac{\partial \ell}{\partial \hat{y}^{(i)}}\) is known to us.
Since the output of \((i-1)\)th layer is the input of the \(i\)th layer, we have \(\frac{\partial \ell}{\partial x^{(i)}}=\frac{\partial \ell}{\partial \hat{y}^{i-1}}\)
Having these two bases, we could compute the gradient of the loss \(\ell\) with respect to the weight and input of every layer by applying chain rules. For example, \(\frac{\partial \ell}{\partial w^{(i)}}=\frac{\partial \ell}{\partial \hat{y}^{(i)}}\frac{\partial \hat{y}^{(i)}}{\partial w^{(i)}}\) where we only need to know how to compute \(\hat{y}^{(i)}\) with \(w^{(i)}\). With these, we could efficiently compute the loss value with respect to every parameter in the neural network and update them with the SGD method.
Numerical Differentiation Besides manually working out the derivatives, we can also estimate the derivatives with numerical approximation. numerical differentiation is an algorithm for estimating the derivative of a mathematical function using the values of the function.
The simplest method, also known as Newton’s differentiation quotient is by using the finite difference approximations. More specifically, if we have a function \(f(x)\) and we want to compute the derivative of \(f\), we can approximate it by computing the slope of a nearby secant line through the points \((x, f(x))\) and \((x+\epsilon, f(x+\epsilon))\). The slope of this secant line will be \(\frac{f(x+\epsilon)-f(x)}{\epsilon}\), and the derivative is the tangent line at the point \(x\). As \(\epsilon\) approaches \(0\), the slope of the secant line approaches the slope of the tangent line. Therefore, the true derivative of \(f\) at the point \(x\) can be defined as \(f'(x)=\lim_{\epsilon\to0}\frac{f(x+\epsilon)-f(x)}{\epsilon}\). Then by this nature, we could manually choose a small enough \(\epsilon\), and to approximately approach the tangent line, i.e. the derivative of the function \(f\) at \(x\).
As we know the point \(x+\epsilon\) is at the right of \(x\), the form \(\frac{f(x+\epsilon)-f(x)}{\epsilon}\) is called right-sided form. Besides this form, we can also approach the tangent line from the left side and right side (the two-sided form) at the same time. To do so, we compute the slope of a nearby secant line through the points \((x-\epsilon, f(x-\epsilon))\) and \((x+\epsilon, f(x+\epsilon))\) by \(\frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}\). This form is a more accurate approximation to the tangent line than the one-sided form and therefore we will use the two-sided form in the following sections.
In order to verify that we are working out the derivatives correctly, we will involve the numerical differentiation as a way of cross-validation, and increase our confidence in the correctness of induction. In 3.1.1 Fully Connected, we will show how to perform the numerical differentiation in a concrete and simple example.