问题引出

在多分类问题中，一些分类器是利用多个二分类器来做分类的，有些模型则能直接进行多分类学习。softmax-regress就是一种能够直接进行多分类学习的线形分类模型（虽然叫regress，但本身是分类模型，有点像logistic-regression）。

one-hot encoding

对于多分类学习任务，我们通常会把可能的分类结果种类用数字来表示，例如对于一个可能有3种分类结果的学习任务，我们可以使用${1, 2, 3}$来分别表示第一类到第三类。如果使用这个自然序编码来对分类结果种类进行编码，那么对于每一个训练样本，我们需要的输出就只有一个。

使用自然序编码时，如果分类结果之间本身没有自然序的关系，对于模型的输出和样本真实的标签之间的距离将不好表示。例如对于某个样本来说，模型的输出结果是1，表示第二类，但样本真实的标签为3，表示第3类，此时如果将距离定义为$|1-3| = 2$，这在分类结果之间本身没有自然序的关系的情况下显然是不太合理的。

另外一种表示分类结果的方法就是one-hot encoding。one-hot encoding将每一种分类结果用一个向量表示，例如，还是对于一个可能有3种分类结果的学习任务，我们可以使用${ {1, 0, 0}, {0, 1, 0}, {0, 0, 1} }$来分别表示第一类到第三类，这样每个种类之间可以看作是没有直接关系的（种类的表示向量之间两两正交）。

所以，对于种类之间有自然序关系的分类学习任务，可以使用自然序来对结果进行编码。但是如果种类之间没有自然序关系，应该尽量使用one-hot encoding。

softmax-regression模型使用的就是one-hot encoding来对分类结果进行表示。

softmax-regression 网络结构

对于一个有n个输入特征，m个可能分类结果的网络来说，其结构为：

$o_1 = x_1w_{11} + x_2w_{21} + \dots + x_nw_{n1} + b_1 \\ o_2 = x_1w_{12} + x_2w_{22} + \dots + x_nw_{n2} +b_2 \\ \dots \\ o_m = x_1w_{1m} + x_2w_{2m} + \dots + x_nw_{nm} + b_m$

由此可见一个softmax-regression网络是一个单层的全连接网络。

softmax函数

在softmax-regression的网络中，我们可将每个输出看作一个样本可能是某一类别的概率，输出结果最大的那个就是最有可能的分类结果。但是最为概率，就需要输出的值位于$[0, 1]$之间，并且所有值之和为1。这对于使用线形模型的输出来说有些困难。所以为了保证让网络输出的结果能够作为概率，需要对输出作为进一步处理，保证输出的结果必须为非负并且和为1。softmax函数就是一个这样的函数:

$y_i = softmax(o_i) = \frac{e^{(oi)}}{\sum_{j=0}^{m} e^{(o_j)}}$ 虽然softmax函数不是一个线性函数，但是softmax-regression的输出仍然是由输入特征的线形变换（仿射变换）决定的，所以softmax-regression仍然是一个线形模型。

交叉熵损失

交叉熵是一个信息论中的概念，它衡量了预测的概率分布和真实的概率分布之间的差异。在softmax-regression中，可以将网络预测的输出看作预测的概率分布，将样本的真实类别对应的ont-hot编码看作是真实概率分布，这样就可以定义模型的损失函数了。其定义如下：

$l(\hat{y}, y) = - \sum_{j=0}^M y_j \space log \space \hat{y_j}$

训练

有了模型和损失函数后，就可以使用随机梯度下降法（Stochastic Gradient Descent）对模型进行训练了。

from liner_regression.fashion_mnist_dataset import *

# read data
batch_size = 256
train_data_iter, test_data_iter = load_data_fashion_mnist(batch_size)

# initial model parameters
# each image in dataset is a 28 * 28 image, in this section, we will flatten each image,
# treat them as vectors of length 784
# so X's size is 256 * 784, W's size is 784 * 10, b's size is 1 * 10, y's size is 256 * 10 (y = softmax(XW + b))
num_inputs = 28 * 28
num_outputs = 10

W = tf.Variable(tf.random.normal(shape=(num_inputs, num_outputs), mean=0, stddev=0.01))
b = tf.Variable(tf.zeros(num_outputs))


# define softmax operation
def softmax(linear_result: tf.Variable):
    # if linear_result is n * m matrix

    # exped is n * m matrix
    exped = tf.exp(linear_result)
    # sum_of_each_line is n * 1 matrix, if keepdims=False, then sum_of_each_line will be 1 * n matrix
    sum_of_each_line = tf.reduce_sum(exped, 1, keepdims=True)
    return exped / sum_of_each_line


# define modal
def net(data_x, param_w, param_b):
    return softmax(tf.matmul(data_x, param_w) + param_b)


# define loss, use cross-entropy loss
def cross_entropy(predicted_y, label_y):
    # predicted_y is a n * m matrix, then label_y is a 1 * n matrix
    # in this example, predicted_y is 256 * 10, label_y = 1 * 256
    return -tf.math.log(tf.boolean_mask(predicted_y, tf.one_hot(label_y, depth=predicted_y.shape[-1])))


# define optimizer
def stochastic_gradient_descent(params, gradients, batch_size, learning_rate: float):
    # Because our loss is calculated as a sum over the mini-batch of examples,
    # we normalize our step size by the batch size (batch_size),
    # so that the magnitude of a typical step size does not depend heavily on our choice of the batch size.
    for param, grad in zip(params, gradients):
        param.assign_sub(grad * learning_rate / batch_size)


# classification accuracy
def accuracy(predicted_y, label_y):
    # predicted_y is a n * m matrix, then label_y is a 1 * n matrix
    # in this example, predicted_y is 256 * 10, label_y = 1 * 256
    # tf.argmax returns the index with the largest value across axes of a tensor.
    predicted_y = tf.argmax(predicted_y, axis=1)

    # cmp is a 1 * n `boolean` matrix
    cmp = tf.cast(predicted_y, label_y.dtype) == label_y
    # return num of right predictions and the total num of predictions
    return tf.reduce_sum(tf.cast(cmp, label_y.dtype)), label_y.shape[0]


# training
def train():
    for _ in range(3):
        num_right_predictions = 0
        num_total_predictions = 0
        for x, y in train_data_iter:
            with tf.GradientTape() as g:
                x = tf.reshape(x, shape=(x.shape[0], -1))
                y_hat = net(x, W, b)
                l = cross_entropy(y_hat, y)

            dw, db = g.gradient(l, [W, b])
            stochastic_gradient_descent([W, b], [dw, db], batch_size, 0.2)

            right_pred, total_pred = accuracy(y_hat, y)
            num_right_predictions += right_pred
            num_total_predictions += total_pred

        print('accuracy after one epoch is : ', float(num_right_predictions) / float(num_total_predictions))


if __name__ == "__main":
    train()