Fully Connected Layers

Definition The fully connected layer is the simplest and most fundamental neural network component as it connects all the nodes in a layer to all the nodes in the next layer. For example, in the below figure we present a simple fully connected layer that connects all nodes, in which yellow nodes represent the bias (the nodes that are independent of the output of the previous layer) in each layer, green nodes represent the input to the neural network, blue nodes represent the intermediate activities and red nodes represent the output results.

Example of Fully Connected Layer

An Example of Fully Connected Layer

From the definition of fully connected layer, we know that in the \(i\)th layer, if there are \(n^{(i)}\) input nodes (in our figure, \(n^{(1)}=2\) (the green nodes), \(n^{(2)}=2\) (the blue nodes)) and \(m^{(i)}\) output nodes (in our figure, \(m^{(0)}=2\) (the blue nodes), \(m^{(1)}=2\) (the red nodes)), there will be \(m^{(i)}\times n^{(i)}\) relations between them. These relations can be represented as a \(m^{(i)}\times n^{(i)}\) matrix \(w^{(i)}\) (weight). Besides of these relations, there will be \(m^{(i)}\) biases that can be represented as a \(m^{(i)}\)d vector \(b^{(i)}\) (bias). (in our figure, \(b^{(0)}=2\) (\(x_0\) to \(h_1^{(1)}\) and \(x_0\) to \(h_2^{(1)}\)), \(b^{(1)}=2\) (\(h_0^{(1)}\) to \(\hat{y}_1\) and \(h_0^{(1)}\) to \(\hat{y}_2\)))

Forward Pass With the above notations, we can directly compute the forward pass. For example, in the example, we can compute \(h_1^{(1)}\) as \(h_1^{(1)}=x_1w^{(1)}_{11}+x_2w^{(1)}_{21}+x_0\) where \(w^{(i)}_{jk}\) is the relation between \(x_j\) and \(h^{(i)}_{k}\). In matrix form, we have \(\hat{Y}=Xw+b\) where \(w\) is the weight matrix, \(X\) is the input vector and \(b\) is the bias vector. In the matrix form, \(w\) is a \(n\times m\) matrix, \(X\) is a \(n\)d vector and \(b\) is a \(m\)d vector.

Backward Pass In the training process, we want to update the relations between the input nodes and the output nodes. More specifically, we want to know how the weight \(w\) and bias \(b\) will impact the loss value, i.e. we want to compute \(\frac{\partial\ell}{\partial w}\) and \(\frac{\partial\ell}{\partial b}\). In the backward pass, the \(i\)th layer will receive the gradient of the loss value with respect to the input of the \((i+1)\)th layer. As the input of the \((i+1)\)th layer is the output of the \(i\)th layer, we have \(\frac{\partial\ell}{\partial \hat{y}_i}\) known indeed.

With these conditions, we can apply the laws we have to compute the gradients. We will have

  • \(\frac{\partial\ell}{\partial w}=X^T\nabla^{(i)}\) (Using the Law 1)

  • \(\frac{\partial\ell}{\partial b}=\frac{\partial\ell}{\partial \hat{y}^i}\frac{\partial \hat{y}^i}{\partial b}=\nabla^{(i)}\)

With these results, we can update the weight and bias in the \(i\)th layer with \(w^{new}=w^{old}-\epsilon \nabla^{(i)} x^T\) and \(b^{new}=b^{old}-\epsilon \nabla^{(i)}\) where \(\epsilon\) is a preset hyperparameter called learning rate. After updating these parameters, we need to pass the gradient of the loss with respect to the input of this layer to the previous layer. Formally, we also have \(\frac{\partial\ell}{\partial x^{i}}=\nabla^{(i)} w^T\) (Using the Law 2).

Example

Assume we have a simple fully connected neural network with only two layers, and each layer have two inputs and two outputs. It can be visualized as below:

Another of Fully Connected Layer

Another Example of Fully Connected Layer

In the above example, we initialize the values as the following:

  • In the input layer, we assume the inputs are \(x_1=0.1\) and \(x_2=0.5\). Thus the input can be represented by a \(2\)d vector \(X=[0.1,0.5]\)

  • Since there are \(2\) input nodes, and \(2\) output nodes, the weight can be represented as a \(2\times 2\) matrix. We assume it to be \(w^{(0)}=\left[ {\begin{array}{*{20}c} w^{(0)}_{11} & w^{(0)}_{12} \\w^{(0)}_{21} & w^{(0)}_{22} \\ \end{array} } \right]=\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\0.25 & 0.30 \\ \end{array} } \right]\) where \(w^{(i)}_{kj}\) is the weight between \(x_k^{(i)}\) and \(y_j^{(i)}\). For example, \(w^{(0)}_{11}\) is the weight between \(x_1\) and \(y_1\) in our case.

  • Since there are \(2\) output nodes, the biases can be represented as a \(2\)d vector. In our example, we assume it to be \(b^{(0)}=[0.35, 0.45]\).

  • In order to compute the backward pass, we need to give a ground truth of the output nodes, so that we can compute the loss value and perform the backpropagation. Here we assume that the ground truth is \(y_1=0.01\) and \(y_2=0.99\). Thus the output can be represented by a \(2\)d vector \(Y=[0.01, 0.99]\).

Forward Pass In the forward pass, we want to compute \(\hat{Y}\). In our case, \(\hat{Y}=Xw^{(0)}+b^{(0)}=[0.1, 0.5]\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\0.25 & 0.30 \\ \end{array} } \right]+[0.35, 0.45]=[0.49, 0.62]\)

Then with the predicted \(Y\), we can compute the loss value as \(\ell=\frac{1}{2}[(0.49-0.01)^2+(0.62-0.99)^2]=0.18365\). We also need to compute the gradient of the loss with respect to the output as \(\frac{\partial \ell}{\partial \hat{y}_1}=\hat{y}_1-y_1=0.48, \frac{\partial \ell}{\partial \hat{y}_2}=\hat{y}_2-y_2=-0.37\). Hence, we have \(\nabla^{(1)}=[0.48, -0.37]\).

Backward Pass Then in the backward pass, as shown above, we can compute the gradient of the loss with respect to the weight, bias and the input. In our case, we have

  • \(\frac{\partial \ell}{\partial w}=X^T \nabla^{(1)} =[0.1, 0.5]^T[0.48, -0.49]=\left[ {\begin{array}{*{20}c}0.048 & -0.037 \\0.24 & -0.185 \\ \end{array} } \right]\)

  • \(\frac{\partial \ell}{\partial b}=\nabla^{(1)}=[0.48, -0.37]\)

  • \(\frac{\partial \ell}{\partial x}=\nabla^{(1)} w^T=[0.48, -0.37]\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\0.25 & 0.30 \\ \end{array} } \right]=[-0.002, 0.009]\)

Numerical Verification In order to verify that our derivatives are induced correctly, we can compute the derivatives numerically. In order to do so, we firstly let \(\epsilon=0.01\) and then we can compute the derivatives of the loss value with respect to the \(w\) as the following:

Similarly, we can do the same procedures for \(w_{12}^{(0)}\), \(w_{21}^{(0)}\) and \(w_{22}^0\). After computation, we have \(\frac{\partial \ell}{w_{12}^{(0)}}=\frac{0.1832805-0.1840205}{0.02}=-0.037\), \(\frac{\partial \ell}{w_{21}^{(0)}}=\frac{0.1860625-0.1812625}{0.02}=0.24\) and \(\frac{\partial \ell}{w_{22}^{(0)}}=\frac{0.184225-0.187925}{0.02}=-0.185\). With these four values we can form the matrix of the loss value with respect to the weight matrix as \(\frac{\partial \ell}{\partial w}=\left[ {\begin{array}{*{20}c} 0.048 & -0.037 \\0.24 & -0.185 \\ \end{array} } \right]\) and this result is consistent with the matrix we get from our inductions. By far we verified that we are computing the derivatives correctly for the weight.

Similarly, we proceed to verify the biases as following:

  • There are two biases in our case, \(b^{(0)}_{1}=0.35\) and \(b^{(0)}_2=0.45\). For \(b^{(0)}_{1}\), we define \(b_{left}=b^{(0)}_{1}-\epsilon=0.34\) and \(b_{right}=b^{(0)}_{1}+\epsilon=0.36\). Then we have

    • \(y_{left}=Xw+b^{(0)}_{left}=[0.1, 0.5]\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\ 0.25 & 0.30 \\ \end{array} } \right]+[0.34, 0.45]=[0.48, 0.62]\), and \(\ell_{left}=\frac{1}{2}[(0.48-0.01)^2+(0.62-0.99)^2]=0.1789\).

    • \(y_{right}=Xw+b^{(0)}_{right}=[0.1, 0.5]\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\ 0.25 & 0.30 \\ \end{array} } \right]+[0.36, 0.45]=[0.50, 0.62]\), and \(\ell_{right}=\frac{1}{2}[(0.50-0.01)^2+(0.62-0.99)^2]=0.1885\).

    • Then we can compute the derivative of loss with respect to the bias \(b_{11}^{(0)}\) as \(\frac{\partial \ell}{b_{11}^{(0)}}=\frac{\ell_{right}-\ell_{left}}{2\epsilon}=\frac{0.1885-0.1789}{0.02}=0.48\).

  • For \(b_{12}^{(0)}\), after computation, we have \(\frac{\ell}{b_{12}^{(0)}}=\frac{0.18485-0.19225}{0.02}=-0.37\). Hence we have \(\frac{\partial \ell}{\partial b}=[0.48, -0.37]\) and it is consistent with the results we have before.

After verified the weights and biases, we now proceed to verify the input matrix as below:

  • There are two inputs in our example, the \(x_1^{(0)}=0.1\) and \(x_2^{(0)}=0.5\). For \(x_1^{(0)}\), we define \(x_{left}=x_1^{(0)}-\epsilon=0.09\), \(x_{right}=x_1^{(0)}+\epsilon=0.11\) and \(X_{left}\), \(X_{right}\) are the corresponding matrices. Then we have:

    • \(y_{left}=X_{left}w+b^{(0)}=[0.09, 0.5]\left[ {\begin{array}{*{20}c} 0.15 & 0.20 \\ 0.25 & 0.30 \\ \end{array} } \right]+[0.35, 0.45]=[0.4885, 0.618]\), and \(\ell_{left}=\frac{1}{2}[(0.48-0.01)^2+(0.62-0.99)^2]=0.183673125\).

    • \(y_{right}=X_{right}w+b^{(0)}=[0.11, 0.5]\left[{\begin{array}{*{20}c}0.15 & 0.20 \\0.25 & 0.30 \\ \end{array}} \right]+[0.35, 0.45]=[0.4915, 0.622]\), and \(\ell_{right}=\frac{1}{2}[(0.4915-0.01)^2+(0.622-0.99)^2]=0.183633125\).

    • Then we can compute the derivative of loss with respect to the input \(x_{1}^{(0)}\) as \(\frac{\partial l}{\partial x_{1}^{(0)}}=\frac{0.183633125-0.183673125}{0.02}=-0.002\).

  • Similarly, for \(x_2^{(0)}\), after computation, we have \(\frac{\partial \ell}{\partial x_{2}^{(0)}}=\frac{0.183747625-0.183567625}{0.02}=0.009\). Hence we will have \(\frac{\partial \ell}{\partial X}=[-0.002, 0.009]\), and the result is consistent with our results above.

By performing the numerical verification as cross-validation, we can confirm that our inductions and computations are correct. With knowing the forward pass and backward pass, we now proceed to investigate how convolution and other operations work.

The implementation of fully connected layer in Tinynet is as below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from .base import Layer
from tinynet.core import Backend as np

class Linear(Layer):
    '''
    Linear layer performs fully connected operation.
    '''
    def __init__(self, name, input_dim, output_dim):
        super().__init__(name)
        # Xavier Initialization
        weight = np.random.randn(
            output_dim, input_dim) * np.sqrt(1/input_dim)
        bias = np.random.randn(output_dim) * np.sqrt(1/input_dim)
        self.type = 'Linear'
        self.weight = self.build_param(weight)
        self.bias = self.build_param(bias)
    
    def forward(self, input):
        '''
        The forward pass of fully connected layer is given by :math:`f(x)=wx+b`.
        '''
        # save input as the input will be used in backward pass
        self.input = input
        return np.matmul(input, self.weight.tensor.T) + self.bias.tensor
    
    def backward(self, in_gradient):
        '''
        In the backward pass, we compute the gradient with respect to :math:`w`, :math:`b`, and :math:`x`.

        We have:

        .. math::

            \\frac{\\partial l}{\\partial w} = \\frac{\\partial l}{\\partial y}\\frac{\\partial y}{\\partial w}=\\frac{\\partial l}{\\partial y} x
        '''
        self.weight.gradient = np.matmul(self.input.T, in_gradient).T
        self.bias.gradient = np.sum(in_gradient, axis=0)
        previous = np.matmul(in_gradient, self.weight.tensor)
        return np.matmul(in_gradient, self.weight.tensor)