# Preliminaries¶

**Convolutional Neural Network**: Convolutional Neural Network (CNN) is
a class of neural networks, and has been proven to be effective for most
computer vision tasks. In a CNN architecture for image classification,
there are usually three important components: the convolutional layer,
the pooling layer and the fully connected layer. The first two types are
designed to extract high-level features from images, and the fully
connected layer can be used as a classifier to output the classification
results. In a convolutional neural network, the convolutional and fully
connected layers are equipped with parameters called *weight* and
*bias*, which will be updated during training. In this work, we
implemented these components along with other necessary components, such
as activations (ReLu function), losses (Cross-Entropy Loss), etc.

**Deconvolution**: The two fundamental components of the CNN are
convolutional layers and pooling layers, which works together to
transform images into feature maps. Deconvolutional operation is a
transformation that goes in the opposite direction of a normal
convolution operation, i.e. from the feature maps that we extracted with
convolution operation to something that has the shape of the input to
it. After being introduced, deconvolution has been used in many fields
such as pixel-wise segmentation, generative models, etc. In this work,
we use deconvolution to map the intermediate feature maps back to the
input pixel space. With this approach, we can show the relationship
between the input patterns and the activations in the feature maps.
There are also two components in the approach, called deconvolutional
layers and unpooling layers, and we will explain these two concepts in
more detail in *Section 3 Approach*.

**Stochastic Gradient Descent**: In neural networks, we want to find a
function \(\hat{y}=F(x)\) such that \(y-F(x)\) is minimal. The
function that we are looking for is usually non-linear, non-convex and
there is generally no formula for it. As a result, gradient descent
becomes one of the most popular methods to find the local minimum of the
function. The method is based on a fact that the function \(f\)
decreases fastest along the direction of the negative gradient.
Formally, we can define a function that measures the difference between
\(\hat{y}\) and \(y\), for example \(f=y-\hat{y}\) and
assume that \(a\) is the only parameter in \(f\). Then if we let
\(a_{n+1}=a_n-\epsilon\nabla_af\) and we want to find the lowest
value of \(f(a)\) around the point \(a\). Then if
\(\epsilon\) is small enough, we will have
\(f(a_{n+1})\leq f(a_n)\) and \(f(a_{n+1})\) is the smallest
value around a small enough interval of \(a\). Considering this, if
we want to find the local minimal of the function \(f\), we can
start at a random point \(a_0\), and follows the negative direction
of the gradient. With this approach, we will have a sequence
\(a_1, a_2,\cdots a_n\) that satisfies
\(a_{n+1}=a_n-\epsilon\nabla_af\). Then the output of the function
\(f\) will satisfy the rule that
\(f(a_n)\leq f(a_{n-1})\leq\cdots \leq f(a_{0})\). By doing so, we
could find an approximate value \(a_n\) such that \(f(a_n)\) is
the local minimal.

**Backpropagation**: In the process of gradient descent, we found that
we need to compute the gradient of our function \(f\) in every step.
Backpropagation, as an application of the chain rule, is an efficient
algorithm for calculating the gradient in deep neural networks. In
short, it first computes the gradient of the loss function to the weight
of the last layer in a neural network, and passes the gradient of the
loss function to the input of the layer to previous layers. There are
two bases for the algorithm:

In the \(i\)th layer, we can receive the gradient of the loss \(\ell\) with respects to the output of \(i\)th layer, i.e. \(\frac{\partial \ell}{\partial \hat{y}^{(i)}}\) is known to us.

Since the output of \((i-1)\)th layer is the input of the \(i\)th layer, we have \(\frac{\partial \ell}{\partial x^{(i)}}=\frac{\partial \ell}{\partial \hat{y}^{i-1}}\)

Having these two bases, we could compute the gradient of the loss \(\ell\) with respect to the weight and input of every layer by applying chain rules. For example, \(\frac{\partial \ell}{\partial w^{(i)}}=\frac{\partial \ell}{\partial \hat{y}^{(i)}}\frac{\partial \hat{y}^{(i)}}{\partial w^{(i)}}\) where we only need to know how to compute \(\hat{y}^{(i)}\) with \(w^{(i)}\). With these, we could efficiently compute the loss value with respect to every parameter in the neural network and update them with the SGD method.

**Numerical Differentiation** Besides manually working out the
derivatives, we can also estimate the derivatives with numerical
approximation. numerical differentiation is an algorithm for estimating
the derivative of a mathematical function using the values of the
function.

The simplest method, also known as Newton’s differentiation quotient is by using the finite difference approximations. More specifically, if we have a function \(f(x)\) and we want to compute the derivative of \(f\), we can approximate it by computing the slope of a nearby secant line through the points \((x, f(x))\) and \((x+\epsilon, f(x+\epsilon))\). The slope of this secant line will be \(\frac{f(x+\epsilon)-f(x)}{\epsilon}\), and the derivative is the tangent line at the point \(x\). As \(\epsilon\) approaches \(0\), the slope of the secant line approaches the slope of the tangent line. Therefore, the true derivative of \(f\) at the point \(x\) can be defined as \(f'(x)=\lim_{\epsilon\to0}\frac{f(x+\epsilon)-f(x)}{\epsilon}\). Then by this nature, we could manually choose a small enough \(\epsilon\), and to approximately approach the tangent line, i.e. the derivative of the function \(f\) at \(x\).

As we know the point \(x+\epsilon\) is at the right of \(x\), the form \(\frac{f(x+\epsilon)-f(x)}{\epsilon}\) is called right-sided form. Besides this form, we can also approach the tangent line from the left side and right side (the two-sided form) at the same time. To do so, we compute the slope of a nearby secant line through the points \((x-\epsilon, f(x-\epsilon))\) and \((x+\epsilon, f(x+\epsilon))\) by \(\frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}\). This form is a more accurate approximation to the tangent line than the one-sided form and therefore we will use the two-sided form in the following sections.

In order to verify that we are working out the derivatives correctly, we
will involve the numerical differentiation as a way of cross-validation,
and increase our confidence in the correctness of induction. In *3.1.1
Fully Connected*, we will show how to perform the numerical
differentiation in a concrete and simple example.