Limitation of Linear Models

linearity implies the weaker assumption of monotonicity, i.e., that any increase in our feature must either always cause an increase in our model’s output (if the corresponding weight is positive), or always cause a decrease in our model’s output (if the corresponding weight is negative).

隐藏层

在许多学习任务中，样本的特征和输出标签之间并不是线形关系。对于单层神经网络，它的能处理的学习任务比较有限。我们可以通过在输入层和输出层之间增加一层或多层的隐藏层来解决增加模型的复杂度，来增加模型的学习能力。但是如果只是增加隐藏层，这个网络仍然是线形模型。原因是每相邻的两层神经网络之间都是一个仿射变换，而仿射变换之上再叠加一个仿射变换，仍然是一个仿射变换。所以为将线形的神经网络变成非线性的神经网络，我们需要引入激活函数。激活函数决定一个一个神经元是否应该继续参与后续的计算。激活函数通常是非线性的函数。

常用激活函数

激活函数决定一个一个神经元是否应该继续参与后续的计算。激活函数通常是非线性的函数

ReLU Function

ReLu激活函数是最流行的一种激活函数，应为它的实现简单并且使用ReLU的神经网络效果也很好。它的定义如下：

$ReLU(x) = max(x, 0)$

即ReLU函数只保留正值的神经元，并且原封不动的进行传播，同时将所有负值的神经元丢弃。

对于ReLU函数的导数来说，当自变量小于0时，ReLU函数的梯度等于0，当自变量大于0时，函数的梯度等于1。但是当自变量的取值为0时，ReLu函数是不可导的。通常在工程上的处理是，当自变量为0时，认为梯度等于0。

pReLU Function

pReLU函数是ReLU函数的一个变种，它允许负值输入有限的向后传播：

$pReLU(x) = max(0, x) + \alpha min(0, x)$

Sigmoid Function

在logistic-regression中我们使用过sigmoid function来作为对输出的处理。同时sigmoid 函数也可以作为神经网络中的激活函数。

$sigmoid(x) = \frac{1}{1+e^{(-x)}}$

sigmoid函数将取值为任意实数的输入映射到一个（0，1）的区间，所以sigmoid函数也被叫做squashing函数。

sigmoid函数是一个光滑的，任意阶可导的函数，它的一阶导数在自变量为0时最大，从0向两边逐渐减小。

sigmoid function通常用于二分类任务神经网络输出层的激活函数，因为sigmoid的数值结果可以看作是概率。

However, the sigmoid has largely been replaced by the simpler and more easily trainable ReLU for most use in hidden layers. Much of this has to do with the fact that the sigmoid poses challenges for optimization since its gradient vanishes for large positive and negative arguments.

Tanh Function

类似sigmoid function，tanh（hyperbolic tangent）也将实数范围内的输入挤压到（-1，1）之间的输出。

$tanh(x) = \frac{1 - e^{(-2x)}}{1 + e^{(-2x)}}$

tanh的函数图像和一阶导函数的图像和sigmoid函数十分类似。

Universal Approximators

Related results suggest that even with a single-hidden-layer network, given enough nodes (possibly absurdly many), and the right set of weights, we can model any function. Actually learning that function is the hard part, though. You might think of your neural network as being a bit like the C programming language. The language, like any other modern language, is capable of expressing any computable program. But actually coming up with a program that meets your specifications is the hard part.

Moreover, just because a single-hidden-layer network can learn any function does not mean that you should try to solve all of your problems with one. In fact, we can approximate many functions much more compactly by using deeper (rather than wider) networks