What happens if someone uses stochastic gradient descent with batch size one in a simple linear regression problem? Does the estimator converge to the OLS estimator which uses the full data?
Using deep neural nets to solve an optimal control problem. Example from Recursive Macroeconomic Theory by Lars Ljungqvist and Tom Sargent, 3rd edition, Exercise 5.11
Stochastic gradient descent is biased toward a specific class of solutions.
Symmetry as an a priori information can lead to extensive dimensionality reduction (alleviation of curse of dimensionality) and extreme generalization power. I provide two examples: rotation and permutation invariance
"It is of course impossible to even think the word Gaussian without immediately mentioning the most important property of Gaussian processes, that is concentration of measure." M. Talagrand
The code is written as a gift for a great teacher and mentor