SoftmaxΒΆ

Softmax is such a function that takes the output of the fully connected layers, and turn them into the probability. Formally, it takes an \(n\)-d vector, and normalizes it to \(n\) probabilities proportional to the exponentials of the input number. It is defined as

\[f(x)=\frac{e^{x_i}}{\sum e^{x_j}}\]

, where \(x_i\) is the \(i\)-th input number.

We can then compute the derivative by using the quotient rule (if \(f(x)=\frac{g(x)}{h(x)}\), then \(f'(x)=\frac{g'(x)h(x)-g(x)h(x)}{h^2(x)}\)). In our case, we have \(g_i=e^{x_i}\) and \(h_i=\sum e^{x_j}\). Then we have \(\frac{\partial g_i}{x_j}=e^{x_i} \: (i=j)\) or \(0 \: (i\neq j)\). For \(h_i\), no matter the relation between \(i\) and \(j\), the derivative will always be \(e^{x_i}\).

Thus we have:

When \(i=j\),

\[\frac{\partial f}{\partial x_i}=\frac{e^{x_i}\sum e^{x_j}-e^{x_i}e^{x_j}}{(\sum e^{x_j})^2}=\frac{e^{x_i}}{\sum{e^{x_j}}}\times \frac{(\sum e^{x_i} - e^{x_i})}{\sum{e^{x_j}}} = f(x_i)(1-f(x_i))\]

When \(i\neq j\),

\[\frac{\partial f}{\partial x_i}=\frac{0-e^{x_i}e^{x_j}}{(\sum e^{x_j})^2}=-\frac{e^{x_i}}{\sum e^{x_j}}\times \frac{e^{x_j}}{\sum e^{x_j}}=-f(x_i)f(x_j)\]

The implementation of softmax layer in Tinynet is as below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from tinynet.core import Backend as np
from .base import Layer    

class Softmax(Layer):
    '''
    Softmax layer returns the probability proportional to the exponentials of the input number.
    '''
    def __init__(self, name='softmax', axis=1, eps=1e-10):
        super().__init__(name)
        self.type = 'Softmax'
        self.axis = 1
        self.eps=eps

    def forward(self, input):
        '''
        Some computational stability tricks here.
        > TODO: to add the tricks
        '''
        self.input = input
        shifted = np.exp(input - input.max(axis=1, keepdims=True))
        result = shifted / shifted.sum(axis=1, keepdims=True)
        return result

    def backward(self, in_gradient):
        '''
        Important: The actual backward gradient is not :math:`1`.

        The reason why we pass the gradient directly to previous layer is: since we know the formula is pretty straightforward when softmax is being used together with cross entropy loss (see theoretical induction), we compute the gradient in the cross entropy loss function, so that we could reduce the complexity, and increase the computational stabilities.
        '''
        return in_gradient