Convolutional Layer

Definition In fully connected layers, the input is always a one-dimensional vector. However, images are usually stored as a multi-dimensional matrix and have implicit spatial structures. For example, the eyes are always on top of the nose, etc. These properties are not well expressed using a fully connected layer. Hence we use the convolutional layers to preserve these properties. For example, assume we have a single channel input matrix \(A_{3\times 3}\) and single filter matrix \(B_{2\times 2}\), and

\[\begin{split}A=\left[ {\begin{array}{*{20}c} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{array} } \right], B=\left[ {\begin{array}{*{20}c} b_{11} & b_{12} \\ b_{21} & b_{22} \end{array} } \right]\end{split}\]

Then, we slide the filter \(B\) with a unit stride, i.e. move one column or one row at a time. If we use \(a_{ij}\) and \(b_{ij}\) to denote the element in \(A\) and \(B\) at the \((i,j)\) location. Then we can obtain the output of the convolutional layer with the following steps:

  • At the beginning, the \(2\times 2\) filter is placed at the upper left corner. At this time, we perform the dot product and we will have \(y_{11}=a_{11}b_{11}+a_{12}b_{12}+a_{21}b_{21}+a_{22}b_{22}\).

  • Then we slide the filter across the width for a unit stride, i.e. we move the slide to the upper right corner. At this time, we perform the dot product and we will have \(y_{12}=a_{12}b_{11}+a_{13}b_{12}+a_{22}b_{21}+a_{23}b_{22}\).

  • Then we found that there is no more values on the right side, so we start to slide the filter on the next row. At this time, we start at the bottom left corner and we can obtain that \(y_{21}=a_{21}b_{11}+a_{22}b_{12}+a_{31}b_{21}+a_{32}b_{22}\).

  • Then we again slide the filter to the right side, i.e. the bottom right corner and we obtain that \(y_{22}=a_{22}b_{11}+a_{23}b_{12}+a_{32}b_{21}+a_{33}b_{22}\).

After these steps, we found that we get four outputs, and we can obtain the final output if we place the output values to corresponding locations, i.e. the value we computed at the upper left corner is placed at the upper left corner and so on so forth. Hence, in this example, we get a \((1,2,2)\) output matrix

\[\begin{split}C=\left[ {\begin{array}{*{20}c} y_{11} & y_{12} \\ y_{21} & y_{22} \end{array} } \right]\end{split}\]

More generally, we can formalize the input, the filters and the output of a convolutional layer as below:

  • The input is a \((N, C, H, W)\) tensor, where \(N\) is the number of the input matrix in a single batch, \(C\) is the channel of the matrix, \(H\) and \(W\) are the height and width of the input matrix. For example, \(10\) coloured images with the size \((224, 224)\) can be represented as a \((10, 3, 224, 224)\) tensor.

  • The filters is a \((K, C, H_f, W_f)\) tensor, where \(K\) is the number of filters, \(C\) is the channel of the filters and it will always be identical to the channel of the input matrix. \(H_f\) and \(W_f\) are the height and width of the filters.

  • The output is a \((N, K, H_{out}, W_{out})\) tensor.

  • The stride that we use to slide the filters are denoted as \(S\).

With these notations, we can compute the output of a convolution layer with seven loops.

Though the convolution operation can be computed by the above algorithm, we can still use matrix multiplication to perform such computation as suggested by A guide to convolution arithmetic for deep learning. The benefits of using matrix multiplication are two-fold:

  • We have already gotten two laws for computing the differentiation of linear transformation. If we can define the convolution operation as \(g(x)=AX+B\) (i.e. matrix multiplication), we could easily reuse the two laws and get the derivative of the loss value with respect to the filter and input.

  • We are about to study how to compute the forward pass of Deconv operation and that operation can be easily defined with the matrix multiplication form of the convolution operation. We will see this in Section 3.2.1 Deconv.

Hence, in the below computation of the forward pass and backward pass of the convolution operation, we will show how to convert it into a matrix multiplication.

Forward Pass Recall that the input \(X\), the filter \(W\) and the expected output \(Y\) we have in the above example.

\[\begin{split}X=\left[ {\begin{array}{*{20}c} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{array} } \right], W=\left[ {\begin{array}{*{20}c} b_{11} & b_{12} \\ b_{21} & b_{22} \end{array} } \right], Y=\left[ {\begin{array}{*{20}c} y_{11} & y_{12} \\ y_{21} & y_{22} \end{array} } \right]\end{split}\]

If we unroll the input and the output into vectors from left to right, top to bottom, we can also represent the filters as a sparse matrix where the non-zero elements are the elements in the filters. For example, We can unroll the input and output in our case as \(X^*_{9\times 1}=[a_{11},a_{12},a_{13},a_{21},a_{22},a_{23},a_{31},a_{32},a_{33}]^T\) and \(Y^*_{4\times 1}=[y_{11},y_{12},y_{21} ,y_{22}]^T\). Then we will want a \(4\times 9\) matrix \(W^*\) such that \(Y^* = W^*X^*\). From the direct computation of convolutional layers, we can transform the original filters \(W\) into

\[\begin{split}W^*=\left[ {\begin{array}{*{20}c} b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 & 0 & 0 & 0 \\ 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 & 0 & 0 \\ 0 & 0 & 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 \\ 0 & 0 & 0 & 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} \end{array} } \right]\end{split}\]

Then we can verify that

\[\begin{split}W^*X^*=\left[ {\begin{array}{*{20}c} b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 & 0 & 0 & 0 \\ 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 & 0 & 0 \\ 0 & 0 & 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 \\ 0 & 0 & 0 & 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} \end{array} } \right] \times [a_{11},a_{12},a_{13}\cdots, a_{33}]^T = Y^*\end{split}\]

Backward Pass From the forward pass, we converted the filters into a sparse matrix \(W^*\), and we found that the convolution is also a linear operation, i.e. \(Y^*=W^*X^*\). Similar to the backward pass of fully connected layers, we can directly compute the gradient of the loss value with respect to the weight matrix as \(\frac{\partial\ell}{\partial W^*}=\nabla^{(i)} (X^*)^T\) (Using the Law 2). Hence, we can update the weight matrix in convolutional layers as \((W^*)^{new}=(W^*)^{old}-\epsilon \nabla^{(i)}(X^*)^T\) and it is the same as in fully connected layers.

Besides, we will need to compute the gradient that we want to pass to previous layers, i.e. the gradient of the loss value with respect to the input matrix. We will have \(\frac{\partial\ell}{\partial X^*}=\frac{\partial\ell}{\partial Y^*}\frac{\partial Y^*}{\partial X^*}=(W^*)^T\nabla^{(i)}\) (Using the Law 1).

Example

Here we will show how the unrolling process works for convolution operation and how to perform the forward pass in the direct and matrix multiplication methods. Since the backward pass is identical to fully connected layers, we will not compute the backward pass in this example.

We assume that we have an input \(X\), the filter \(W\) as

\[\begin{split}X=\left[ {\begin{array}{*{20}c} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{array} } \right], W=\left[ {\begin{array}{*{20}c} 1 & 2 \\ 3 & 4 \end{array} } \right]\end{split}\]

Direct Computation To compute the output directly, we will have to slide the filter \(W\) from left to right, from top to bottom. We will have \(1*1+2*2+4* 3+5*4=37\), \(2*1+3*2+5*3+6*4=47\), \(4*1+5*2+7*3+8*4=67\), \(5*1+6*2+8*3+9*4=77\) in the upper left, upper right, bottom left and bottom right corners. Then, we can get the output as

\[\begin{split}Y=\left[ {\begin{array}{*{20}c} 37 & 47 \\ 67 & 77 \end{array} } \right]\end{split}\]

Matrix Multiplcation With matrix multiplication approach, we need to unroll the filter and input matrices into

\[\begin{split}W^*=\left[ {\begin{array}{*{20}c} 1 & 2 & 0 & 3 & 4 & 0 & 0 & 0 & 0 \\ 0 & 1 & 2 & 0 & 3 & 4 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 2 & 0 & 3 & 4 & 0 \\ 0 & 0 & 0 & 0 & 1 & 2 & 0 & 3 & 4 \end{array} } \right], X^*=\left[ {\begin{array}{*{20}c} 1 & 2 & \cdots & 9 \end{array} } \right]^T\end{split}\]

Then we can compute the output directly by \(Y^*=W^*X^*=[37,47,67,77]^T\). By reshaping \(Y^*\), we can easily obtain the desired output matrix \(Y_{2\times 2}\).

Since the backward process will be identical to what we did in Section 3.1.1 Fully Connected if we perform the forward pass in a matrix multiplication way, we will omit the examples here.

The implementation of convolutional layer in Tinynet is as below:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
from .base import Layer
from tinynet.core import Backend as np
from tinynet.core import GPU

'''
These im2col and col2im should be credited to:
https://github.com/huyouare/CS231n. Some minor modifications are made to ensure
it works on GPU.
'''
def get_im2col_indices(x_shape, field_height=3, field_width=3, padding=1, stride=1):
    # First figure out what the size of the output should be
    N, C, H, W = x_shape
    # check if the output shape are integers
    assert (H + 2 * padding - field_height) % stride == 0
    assert (W + 2 * padding - field_height) % stride == 0
    out_height = (H + 2 * padding - field_height) / stride + 1
    out_width = (W + 2 * padding - field_width) / stride + 1
    # np.arange generates an evenly paced array, here it will be [0,1,2,..., field_height]
    # then np.repeat will repeat each elements in the array. here i0 will be 
    # [0,0,0...,0, 1,...,1,...., field_height,...,field_height]. Each element will be repeated field_width times.
    i0 = np.repeat(np.arange(field_height, dtype='int32'), field_width)
    # np.tile will repeat the whole array for C times.
    # i0 will become [0,0,...,0,1....,field_height,0,...,field_height]
    i0 = np.tile(i0, C)
    # Similarly, i1 will be [0,...0, 1*stride,...,1*stride,..., out_width * stride]
    i1 = stride * np.repeat(np.arange(int(out_height),
                                      dtype='int32'), int(out_width))

    j0 = np.tile(np.arange(field_width), field_height * C)
    j1 = stride * np.tile(np.arange(out_width, dtype='int32'), int(out_height))
    i = i0.reshape(-1, 1) + i1.reshape(1, -1)
    j = j0.reshape(-1, 1) + j1.reshape(1, -1)

    k = np.repeat(np.arange(C, dtype='int32'),
                  field_height * field_width).reshape(-1, 1)
    # shape of k: (C * field_height * field_width, 1)
    # shape of i: (C * field_height * field_width, out_height * out_width)
    # shape of j: (C * field_height * field_width, out_height * out_width)
    # k for indicating channels
    return (k, i, j)


def im2col_indices(x, field_height=3, field_width=3, padding=1, stride=1):
    """ An implementation of im2col based on some fancy indexing """
    # Zero-pad the input
    p = padding
    # add paddings to neighbors
    x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant')

    k, i, j = get_im2col_indices(x.shape, field_height, field_width, padding,
                                 stride)
    # get the columns with fancy indexing
    cols = x_padded[:, k, i, j]
    C = x.shape[1]
    cols = cols.transpose(1, 2, 0).reshape(field_height * field_width * C, -1)
    return cols


def col2im_indices(cols, x_shape, field_height=3, field_width=3, padding=1,
                   stride=1):
    '''
    An implementation of col2im based on fancy indexing and np.add.at
    '''
    N, C, H, W = x_shape
    H_padded, W_padded = H + 2 * padding, W + 2 * padding
    x_padded = np.zeros((N, C, H_padded, W_padded), dtype=cols.dtype)
    k, i, j = get_im2col_indices(x_shape, field_height, field_width, padding,
                                 stride)
    cols_reshaped = cols.reshape(C * field_height * field_width, -1, N)
    cols_reshaped = cols_reshaped.transpose(2, 0, 1)
    if GPU:
        # In cupy, scatter_add is equivalent to np.add.at
        np.scatter_add(x_padded, (slice(None), k, i, j), cols_reshaped)
    else:
        # ufunc.at performed unbuffered inplace operation.
        # (a, indices, b)
        # For addition ufunc, this method is equivalent to a[indices] += b,
        # except that results are accumulated for elements that are indexed 
        # more than once
        np.add.at(x_padded, (slice(None), k, i, j), cols_reshaped)
    if padding == 0:
        return x_padded
    return x_padded[:, :, padding:-padding, padding:-padding]

class Conv2D(Layer):
    '''
    Conv2D performs convolutional operation with given input.
    '''

    def __init__(self, name, input_dim, n_filter, h_filter, w_filter, stride, padding):
        '''
        :param input_dim: the input data dimension, it should be of the shape (N, C, H, W) where N is for number of input data, C is for the channel, H for height and W for width.

        :param n_filter: the number of filters used in this layer. It can be any integers.

        :param h_filter: the height of the filter.

        :param w_filter: the width of the filter.

        :param stride: the stride of the sliding filter.
        '''
        super().__init__(name)
        self.type = 'Conv2D'
        self.input_channel, self.input_height, self.input_width = input_dim
        self.n_filter = n_filter
        self.h_filter = h_filter
        self.w_filter = w_filter
        self.stride = stride
        self.padding = padding
        weight = np.random.randn(
            self.n_filter, self.input_channel, self.h_filter, self.w_filter) * np.sqrt(1.0/(self.input_channel* self.h_filter * self.w_filter))
        bias = np.random.randn(self.n_filter) * np.sqrt(1.0/(self.input_channel* self.h_filter * self.w_filter))
        self.weight = self.build_param(weight)
        self.bias = self.build_param(bias)
        self.out_height = (self.input_height - self.h_filter +
                           2 * padding) / self.stride + 1
        self.out_width = (self.input_width - self.w_filter +
                          2 * padding) / self.stride + 1
        if not self.out_width.is_integer() or not self.out_height.is_integer():
            raise Exception("[Tinynet] Invalid dimension settings!")
        self.out_height, self.out_width = int(
            self.out_height), int(self.out_width)
        self.out_dim = (self.n_filter, self.out_height, self.out_width)

    def forward(self, input):
        self.n_input = input.shape[0]
        self.input_col = im2col_indices(
            input, self.h_filter, self.w_filter, stride=self.stride, padding=self.padding)
        weight_in_row = self.weight.tensor.reshape(self.n_filter, -1)
        output = np.matmul(weight_in_row, self.input_col) + self.bias.tensor.reshape(self.n_filter, 1)
        output = output.reshape(
            self.n_filter, self.out_height, self.out_width, self.n_input)
        output = output.transpose(3, 0, 1, 2)
        return output

    def backward(self, in_gradient):
        gradient_flat = in_gradient.transpose(
            1, 2, 3, 0).reshape(self.n_filter, -1)
        weight_gradient = np.matmul(gradient_flat, self.input_col.T)
        self.weight.gradient = weight_gradient.reshape(
            self.weight.tensor.shape)
        self.bias.gradient = np.sum(in_gradient, axis=(
            0, 2, 3)).reshape(self.n_filter)
        weight_flat = self.weight.tensor.reshape(self.n_filter, -1)
        out_gradient_col = np.matmul(weight_flat.T, gradient_flat)
        shape = (self.n_input, self.input_channel,
                 self.input_height, self.input_width)
        out_gradient = col2im_indices(out_gradient_col, shape, self.h_filter,
                                      self.w_filter, self.padding, self.stride)
        return out_gradient