# Fully Connected Layers¶

**Definition** The fully connected layer is the simplest and most
fundamental neural network component as it connects all the nodes in a
layer to all the nodes in the next layer. For example, in the below figure we present a simple fully connected layer that connects all
nodes, in which yellow nodes represent the bias (the nodes that are
independent of the output of the previous layer) in each layer, green
nodes represent the input to the neural network, blue nodes represent
the intermediate activities and red nodes represent the output results.

From the definition of fully connected layer, we know that in the \(i\)th layer, if there are \(n^{(i)}\) input nodes (in our figure, \(n^{(1)}=2\) (the green nodes), \(n^{(2)}=2\) (the blue nodes)) and \(m^{(i)}\) output nodes (in our figure, \(m^{(0)}=2\) (the blue nodes), \(m^{(1)}=2\) (the red nodes)), there will be \(m^{(i)}\times n^{(i)}\) relations between them. These relations can be represented as a \(m^{(i)}\times n^{(i)}\) matrix \(w^{(i)}\) (weight). Besides of these relations, there will be \(m^{(i)}\) biases that can be represented as a \(m^{(i)}\)d vector \(b^{(i)}\) (bias). (in our figure, \(b^{(0)}=2\) (\(x_0\) to \(h_1^{(1)}\) and \(x_0\) to \(h_2^{(1)}\)), \(b^{(1)}=2\) (\(h_0^{(1)}\) to \(\hat{y}_1\) and \(h_0^{(1)}\) to \(\hat{y}_2\)))

**Forward Pass** With the above notations, we can directly compute the
forward pass. For example, in the example, we can compute
\(h_1^{(1)}\) as
\(h_1^{(1)}=x_1w^{(1)}_{11}+x_2w^{(1)}_{21}+x_0\) where
\(w^{(i)}_{jk}\) is the relation between \(x_j\) and
\(h^{(i)}_{k}\). In matrix form, we have \(\hat{Y}=Xw+b\) where
\(w\) is the weight matrix, \(X\) is the input vector and
\(b\) is the bias vector. In the matrix form, \(w\) is a
\(n\times m\) matrix, \(X\) is a \(n\)d vector and
\(b\) is a \(m\)d vector.

**Backward Pass** In the training process, we want to update the
relations between the input nodes and the output nodes. More
specifically, we want to know how the weight \(w\) and bias
\(b\) will impact the loss value, i.e. we want to compute
\(\frac{\partial\ell}{\partial w}\) and
\(\frac{\partial\ell}{\partial b}\). In the backward pass, the
\(i\)th layer will receive the gradient of the loss value with
respect to the input of the \((i+1)\)th layer. As the input of the
\((i+1)\)th layer is the output of the \(i\)th layer, we
have \(\frac{\partial\ell}{\partial \hat{y}_i}\) known indeed.

With these conditions, we can apply the laws we have to compute the gradients. We will have

\(\frac{\partial\ell}{\partial w}=X^T\nabla^{(i)}\) (Using the Law 1)

\(\frac{\partial\ell}{\partial b}=\frac{\partial\ell}{\partial \hat{y}^i}\frac{\partial \hat{y}^i}{\partial b}=\nabla^{(i)}\)

With these results, we can update the weight and bias in the \(i\)th layer with \(w^{new}=w^{old}-\epsilon \nabla^{(i)} x^T\) and \(b^{new}=b^{old}-\epsilon \nabla^{(i)}\) where \(\epsilon\) is a preset hyperparameter called learning rate. After updating these parameters, we need to pass the gradient of the loss with respect to the input of this layer to the previous layer. Formally, we also have \(\frac{\partial\ell}{\partial x^{i}}=\nabla^{(i)} w^T\) (Using the Law 2).

## Example¶

Assume we have a simple fully connected neural network with only two layers, and each layer have two inputs and two outputs. It can be visualized as below:

In the above example, we initialize the values as the following:

In the input layer, we assume the inputs are \(x_1=0.1\) and \(x_2=0.5\). Thus the input can be represented by a \(2\)d vector \(X=[0.1,0.5]\)

Since there are \(2\) input nodes, and \(2\) output nodes, the weight can be represented as a \(2\times 2\) matrix. We assume it to be \(w^{(0)}=\left[ {\begin{array}{*{20}c} w^{(0)}_{11} & w^{(0)}_{12} \\w^{(0)}_{21} & w^{(0)}_{22} \\ \end{array} } \right]=\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\0.25 & 0.30 \\ \end{array} } \right]\) where \(w^{(i)}_{kj}\) is the weight between \(x_k^{(i)}\) and \(y_j^{(i)}\). For example, \(w^{(0)}_{11}\) is the weight between \(x_1\) and \(y_1\) in our case.

Since there are \(2\) output nodes, the biases can be represented as a \(2\)d vector. In our example, we assume it to be \(b^{(0)}=[0.35, 0.45]\).

In order to compute the backward pass, we need to give a ground truth of the output nodes, so that we can compute the loss value and perform the backpropagation. Here we assume that the ground truth is \(y_1=0.01\) and \(y_2=0.99\). Thus the output can be represented by a \(2\)d vector \(Y=[0.01, 0.99]\).

*Forward Pass* In the forward pass, we want to compute \(\hat{Y}\).
In our case,
\(\hat{Y}=Xw^{(0)}+b^{(0)}=[0.1, 0.5]\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\0.25 & 0.30 \\ \end{array} } \right]+[0.35, 0.45]=[0.49, 0.62]\)

Then with the predicted \(Y\), we can compute the loss value as \(\ell=\frac{1}{2}[(0.49-0.01)^2+(0.62-0.99)^2]=0.18365\). We also need to compute the gradient of the loss with respect to the output as \(\frac{\partial \ell}{\partial \hat{y}_1}=\hat{y}_1-y_1=0.48, \frac{\partial \ell}{\partial \hat{y}_2}=\hat{y}_2-y_2=-0.37\). Hence, we have \(\nabla^{(1)}=[0.48, -0.37]\).

*Backward Pass* Then in the backward pass, as shown above, we can
compute the gradient of the loss with respect to the weight, bias and
the input. In our case, we have

\(\frac{\partial \ell}{\partial w}=X^T \nabla^{(1)} =[0.1, 0.5]^T[0.48, -0.49]=\left[ {\begin{array}{*{20}c}0.048 & -0.037 \\0.24 & -0.185 \\ \end{array} } \right]\)

\(\frac{\partial \ell}{\partial b}=\nabla^{(1)}=[0.48, -0.37]\)

\(\frac{\partial \ell}{\partial x}=\nabla^{(1)} w^T=[0.48, -0.37]\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\0.25 & 0.30 \\ \end{array} } \right]=[-0.002, 0.009]\)

*Numerical Verification* In order to verify that our derivatives are
induced correctly, we can compute the derivatives numerically. In order
to do so, we firstly let \(\epsilon=0.01\) and then we can compute
the derivatives of the loss value with respect to the \(w\) as the
following:

Similarly, we can do the same procedures for \(w_{12}^{(0)}\), \(w_{21}^{(0)}\) and \(w_{22}^0\). After computation, we have \(\frac{\partial \ell}{w_{12}^{(0)}}=\frac{0.1832805-0.1840205}{0.02}=-0.037\), \(\frac{\partial \ell}{w_{21}^{(0)}}=\frac{0.1860625-0.1812625}{0.02}=0.24\) and \(\frac{\partial \ell}{w_{22}^{(0)}}=\frac{0.184225-0.187925}{0.02}=-0.185\). With these four values we can form the matrix of the loss value with respect to the weight matrix as \(\frac{\partial \ell}{\partial w}=\left[ {\begin{array}{*{20}c} 0.048 & -0.037 \\0.24 & -0.185 \\ \end{array} } \right]\) and this result is consistent with the matrix we get from our inductions. By far we verified that we are computing the derivatives correctly for the weight.

Similarly, we proceed to verify the biases as following:

There are two biases in our case, \(b^{(0)}_{1}=0.35\) and \(b^{(0)}_2=0.45\). For \(b^{(0)}_{1}\), we define \(b_{left}=b^{(0)}_{1}-\epsilon=0.34\) and \(b_{right}=b^{(0)}_{1}+\epsilon=0.36\). Then we have

\(y_{left}=Xw+b^{(0)}_{left}=[0.1, 0.5]\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\ 0.25 & 0.30 \\ \end{array} } \right]+[0.34, 0.45]=[0.48, 0.62]\), and \(\ell_{left}=\frac{1}{2}[(0.48-0.01)^2+(0.62-0.99)^2]=0.1789\).

\(y_{right}=Xw+b^{(0)}_{right}=[0.1, 0.5]\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\ 0.25 & 0.30 \\ \end{array} } \right]+[0.36, 0.45]=[0.50, 0.62]\), and \(\ell_{right}=\frac{1}{2}[(0.50-0.01)^2+(0.62-0.99)^2]=0.1885\).

Then we can compute the derivative of loss with respect to the bias \(b_{11}^{(0)}\) as \(\frac{\partial \ell}{b_{11}^{(0)}}=\frac{\ell_{right}-\ell_{left}}{2\epsilon}=\frac{0.1885-0.1789}{0.02}=0.48\).

For \(b_{12}^{(0)}\), after computation, we have \(\frac{\ell}{b_{12}^{(0)}}=\frac{0.18485-0.19225}{0.02}=-0.37\). Hence we have \(\frac{\partial \ell}{\partial b}=[0.48, -0.37]\) and it is consistent with the results we have before.

After verified the weights and biases, we now proceed to verify the input matrix as below:

There are two inputs in our example, the \(x_1^{(0)}=0.1\) and \(x_2^{(0)}=0.5\). For \(x_1^{(0)}\), we define \(x_{left}=x_1^{(0)}-\epsilon=0.09\), \(x_{right}=x_1^{(0)}+\epsilon=0.11\) and \(X_{left}\), \(X_{right}\) are the corresponding matrices. Then we have:

\(y_{left}=X_{left}w+b^{(0)}=[0.09, 0.5]\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\ 0.25 & 0.30 \\ \end{array} } \right]+[0.35, 0.45]=[0.4885, 0.618]\), and \(\ell_{left}=\frac{1}{2}[(0.48-0.01)^2+(0.62-0.99)^2]=0.183673125\).

\(y_{right}=X_{right}w+b^{(0)}=[0.11, 0.5]\left[{\begin{array}{*{20}c}0.15 & 0.20 \\0.25 & 0.30 \\ \end{array}} \right]+[0.35, 0.45]=[0.4915, 0.622]\), and \(\ell_{right}=\frac{1}{2}[(0.4915-0.01)^2+(0.622-0.99)^2]=0.183633125\).

Then we can compute the derivative of loss with respect to the input \(x_{1}^{(0)}\) as \(\frac{\partial l}{\partial x_{1}^{(0)}}=\frac{0.183633125-0.183673125}{0.02}=-0.002\).

Similarly, for \(x_2^{(0)}\), after computation, we have \(\frac{\partial \ell}{\partial x_{2}^{(0)}}=\frac{0.183747625-0.183567625}{0.02}=0.009\). Hence we will have \(\frac{\partial \ell}{\partial X}=[-0.002, 0.009]\), and the result is consistent with our results above.

By performing the numerical verification as cross-validation, we can confirm that our inductions and computations are correct. With knowing the forward pass and backward pass, we now proceed to investigate how convolution and other operations work.

The implementation of fully connected layer in Tinynet is as below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | ```
from .base import Layer
from tinynet.core import Backend as np
class Linear(Layer):
'''
Linear layer performs fully connected operation.
'''
def __init__(self, name, input_dim, output_dim):
super().__init__(name)
# Xavier Initialization
weight = np.random.randn(
output_dim, input_dim) * np.sqrt(1/input_dim)
bias = np.random.randn(output_dim) * np.sqrt(1/input_dim)
self.type = 'Linear'
self.weight = self.build_param(weight)
self.bias = self.build_param(bias)
def forward(self, input):
'''
The forward pass of fully connected layer is given by :math:`f(x)=wx+b`.
'''
# save input as the input will be used in backward pass
self.input = input
return np.matmul(input, self.weight.tensor.T) + self.bias.tensor
def backward(self, in_gradient):
'''
In the backward pass, we compute the gradient with respect to :math:`w`, :math:`b`, and :math:`x`.
We have:
.. math::
\\frac{\\partial l}{\\partial w} = \\frac{\\partial l}{\\partial y}\\frac{\\partial y}{\\partial w}=\\frac{\\partial l}{\\partial y} x
'''
self.weight.gradient = np.matmul(self.input.T, in_gradient).T
self.bias.gradient = np.sum(in_gradient, axis=0)
previous = np.matmul(in_gradient, self.weight.tensor)
return np.matmul(in_gradient, self.weight.tensor)
``` |