0%

weight-decay

Norms and Weight Decay

We can always mitigate overfitting by collecting more training data. However, that can be costly, time consuming, or entirely out of our control, making it impossible in the short run.

Rather than directly manipulating the number of parameters, weight decay, operates by restricting the values that the parameters can take. weight decay might be the most widely used technique for regularizing parametric machine learning models.

and the updation for w in stochastic gradient descent with this loss function will be:

every literation, we will shrink w by $lr \times \lambda$ first, that’s why this method is called weight decay.

Often, we do not regularize the bias term.

为什么Weight Decay可以一定程度上缓解过拟合

因为在实际中,用于机器学习的数据是有噪声的,这些噪声会导致模型无法真正学习到最优的解,一般会和最优解有偏差,而且,可以证明,当噪声越大,学习到的$w$就会越大。这时可以通过weight decay调整$\lambda$的值来讲学习到的解拉近到真正最优解,如果$\lambda$过小,拉近之后的解仍然会离真正的最优解一定距离,如果$\lambda$过大,就会导致从另外一个方向远离真正最优解。