Linear Transformation of Matrices¶
Differentiation of Linear Transformation of Matrices As suggested above, we will need to compute the derivatives of the loss functions to the parameters in neural networks. As we usually represent the data (such as the input, output, other parameters, etc) in neural networks as matrices, and the most fundamental transformations in neural networks are linear, it is essential to understand how to compute the derivatives of a linear transformation of matrices by using the chain rule.
Since there is no such concept “Layer” in this process, we will use \(X\) and \(Y\) to denote the matrices and \(x_{ij}\) and \(y_{kl}\) to represent entries in matrices without the superscripts.
Assume that we have \(f(Y):\mathbb{R}^{m\times n}\to\mathbb{R}\) and a linear tranformation \(g(X):\mathbb{R}^{p\times n}\to \mathbb{R}^{m\times n}, Y=g(X)=AX+B\), where \(A\in\mathbb{R}^{m\times p}, B\in\mathbb{R}^{m\times n}\). We can compute the derivatives of \(f\) with respect to \(X\) as the following:
We know, at the point \(x\), if there are two intermediate variables \(u=\phi(x)\) and \(v=\psi(x)\) that have partial derivatives with respect to \(x\) defined, then the composited function \(f(u,v)\) has partial derivatives with respect to \(x\) defined and can be computed as \(\frac{\partial f}{\partial x}=\frac{\partial f}{\partial u}\frac{\partial u}{\partial x}+\frac{\partial f}{\partial v}\frac{\partial v}{\partial x}\). In our case, there might be several intermediate variables \(y_{kl}\), hence we have \(\frac{\partial f}{\partial x_{ij}}=\sum_{kl}\frac{\partial f}{\partial y_{kl}}\frac{\partial y_{kl}}{\partial x_{ij}}\) (1).
Let \(a_{ij}\) and \(b_{ij}\) represent elements in \(A\) and \(B\), then \(y_{kl}=\sum_{s}a_{ks}x_{sl}+b_{kl}\). Hence we have \(\frac{\partial y_{kl}}{\partial x_{ij}}=\frac{\partial \sum_{s}a_{ks}x_{sl}}{\partial x_{ij}}=\frac{\partial a_{ki}x_{il}}{\partial x_{ij}}=a_{ki}\delta_{lj}\) (2). Here \(\delta_{lj}\) is defined as \(\delta_{lj}=1\) when \(l=j\), otherwise \(\delta_{lj}=0\). Intuitively, we know that for the single pair \((x_{ij}, y_{kl})\), there is a relation \(y_{kl}=a_{ki}x_{il}+b_{kl}\). Hence the derivative of \(y_{kl}\) with respect to \(x_{ij}\) is \(a_{ki}\) only when \(l=j\), otherwise, the derivative will be \(0\).
Take (2) into (1), we will have \(\frac{\partial f}{\partial x_{ij}}=\sum_{kl}\frac{\partial f}{\partial y_{kl}}\frac{\partial y_{kl}}{\partial x_{ij}}=\sum_{kl}\frac{\partial f}{\partial y_{kl}}a_{ki}\delta_{lj}=\sum_{k}\frac{\partial f}{\partial y_{kj}}a_{ki}\) (because only \(y_{kj}\) will be kept). In this equation, we know that \(a_{ki}\) is the \(i\)th row of \(A^T\) and \(\frac{\partial f}{\partial y_{kj}}\) is the \((k,j)\) element in the gradient of \(f\) with respect to \(Y\). In summary, this equation tells us that the derivative of \(f\) with respect to \(x_{ij}\) is the dot product of the \(i\)th row of \(A^T\) and the \(j\)th column of \(\nabla_Yf\).
Now that we have already known that \(\frac{\partial f}{\partial x_{ij}}\) is the dot product of the \(i\)th row of \(A^T\) and the \(j\)th column of \(\nabla_Yf\). Then for \(\frac{\partial f}{\partial X}\), we have
\[\begin{split}\frac{\partial f}{\partial X}=\left[ {\begin{array}{*{20}c} \frac{\partial f}{\partial x_{11}} & \cdots & \frac{\partial f}{\partial x_{1n}} \\ \vdots & \frac{\partial f}{\partial x_{ij}} & \vdots \\ \frac{\partial f}{\partial x_{p1}} & \cdots & \frac{\partial f}{\partial x_{pn}} \end{array} } \right]=A^T\nabla_Y f\end{split}\]because every element in \(\frac{\partial f}{\partial X}\) equals to a inner product of a row and a column.
It is also common that \(g(X)\) is defined as \(g(X):\mathbb{R}^{m\times p}\to \mathbb{R}^{m\times n}, Y=g(X)=XC+D\) where \(C\in\mathbb{R}^{p\times n}\) and \(D\in\mathbb{R}^{m\times n}\). In this case, we have \(f(Y):\mathbb{R}^{m\times n}\to\mathbb{R}\) defined as well. Then to compute \(\frac{\partial f}{\partial X}\), we first consider \(Y^T\) and we have \(Y^T=(XC+D)^T=C^TX^T+D^T\). Then by the laws we have found above, we immediately know that \(\nabla_{X^T}f=(C^T)^T\nabla_{Y^T} f=C\nabla_{Y^T} f\). Therefore, we have \(\nabla_X f = (\nabla_{X^T}f)^T=(C\nabla_{Y^T} f)^T=(\nabla_{Y^T} f)^TC^T=(\nabla_{Y} f)C^T\)
In summary, if we have two functions, \(f(Y)\) takes a matrix and returns a scalar, and a linear tranformation function \(g(X)\), then we can perform the differentiation using the chain rule. More specifically, we have found two laws:
If the linear transformation is defined as \(g(X)=AX+B\) (we call this left multiplcation), then we have \(\nabla_X f=A^T\nabla_Y f\). (Law 1)
If the linear transformation is defined as \(g(X)=XC+D\) (we call this right multiplcation), then we have \(\nabla_X f=(\nabla_Y f)C^T\). (Law 2)
Note: We should be careful about the independent variable. Here we are computing the gradient of the function \(f\) with respect to \(X\), we can also compute the gradient of the function \(f\) with respect to \(C\). In that case, we should use a different law.
These two laws are the most important and fundamental conclusions we have in the whole work, and we will find out that the essential components of a convolutional neural network (fully connected, convolution, etc) are linear transformations and the loss function is the \(f(Y)\) that we have here. As a result, we will show how to transform those components into a linear transformation and these transformations will form the mainline of the Section 3 Approach.