# Softmax¶

Softmax is such a function that takes the output of the fully connected layers, and turn them into the probability. Formally, it takes an $$n$$-d vector, and normalizes it to $$n$$ probabilities proportional to the exponentials of the input number. It is defined as

$f(x)=\frac{e^{x_i}}{\sum e^{x_j}}$

, where $$x_i$$ is the $$i$$-th input number.

We can then compute the derivative by using the quotient rule (if $$f(x)=\frac{g(x)}{h(x)}$$, then $$f'(x)=\frac{g'(x)h(x)-g(x)h(x)}{h^2(x)}$$). In our case, we have $$g_i=e^{x_i}$$ and $$h_i=\sum e^{x_j}$$. Then we have $$\frac{\partial g_i}{x_j}=e^{x_i} \: (i=j)$$ or $$0 \: (i\neq j)$$. For $$h_i$$, no matter the relation between $$i$$ and $$j$$, the derivative will always be $$e^{x_i}$$.

Thus we have:

When $$i=j$$,

$\frac{\partial f}{\partial x_i}=\frac{e^{x_i}\sum e^{x_j}-e^{x_i}e^{x_j}}{(\sum e^{x_j})^2}=\frac{e^{x_i}}{\sum{e^{x_j}}}\times \frac{(\sum e^{x_i} - e^{x_i})}{\sum{e^{x_j}}} = f(x_i)(1-f(x_i))$

When $$i\neq j$$,

$\frac{\partial f}{\partial x_i}=\frac{0-e^{x_i}e^{x_j}}{(\sum e^{x_j})^2}=-\frac{e^{x_i}}{\sum e^{x_j}}\times \frac{e^{x_j}}{\sum e^{x_j}}=-f(x_i)f(x_j)$

The implementation of softmax layer in tinyml is as below:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 from tinyml.core import Backend as np from .base import Layer class Softmax(Layer): ''' Softmax layer returns the probability proportional to the exponentials of the input number. ''' def __init__(self, name='softmax', axis=1, eps=1e-10): super().__init__(name) self.type = 'Softmax' self.axis = 1 self.eps = eps def forward(self, input): ''' Some computational stability tricks here. > TODO: to add the tricks ''' self.input = input shifted = np.exp(input - input.max(axis=1, keepdims=True)) result = shifted / shifted.sum(axis=1, keepdims=True) return result def backward(self, in_gradient): ''' Important: The actual backward gradient is not :math:1. The reason why we pass the gradient directly to previous layer is: since we know the formula is pretty straightforward when softmax is being used together with cross entropy loss (see theoretical induction), we compute the gradient in the cross entropy loss function, so that we could reduce the complexity, and increase the computational stabilities. ''' return in_gradient