Introductory statistics applications educate us that, when becoming a model to some recordsdata, we must dangle more recordsdata than free parameters to retain some distance off from the disaster of overfitting—becoming noisy recordsdata too carefully, and thereby failing to compare contemporary recordsdata. It is some distance beautiful, then, that in contemporary deep finding out the practice is to dangle orders of magnitude more parameters than recordsdata. Despite this, deep networks existing beautiful predictive performance, and in truth ticket higher the more parameters they’ve. Why would that be?
Why deep networks generalize despite going against statistical intuition
Read Extra