Xiang Cheng (Berkeley) - Sampling as Optimization and Optimization as Sampling
Sampling as Optimization and Optimization as Sampling
This talk presents a series of results that draw connections between optimization and sampling. In one such result, we show that the Langevin SDE corresponds to the gradient flow of KL divergence with respect to the 2-Wasserstein metric in probability space. This allows us to prove convergence of Langevin MCMC in KL divergence, and even achieve accelerated rates in a similar fashion to Nesterov’s accelerated gradient descent. In the reverse direction, we can also show that Stochastic Gradient Descent may be viewed as the discretization of a certain Stochastic Differential Equation with a state-dependent diffusion matrix that corresponds to the covariance matrix of the sampled stochastic gradient. This theory helps us explain the behavior of SGD in settings such as the training of deep neural networks, where it has been observed that larger noise (in the form of smaller batch-size/larger step-size) gives smaller generalization error.
Based on joint work with Peter Bartlett, Niladri Chatterji, Michael Jordan, Yian Ma, Dong Yin
Host: Nati Srebro
Xiang Cheng
I am graduate student in the EECS department at UC Berkeley, co-advised by Peter Bartlett and Michael Jordan. I am interested in the connections between optimization and sampling algorithms for machine learning. Recently, I have been trying use insights from SDE theory to understand the statistical consequences of randomness in algorithms such as Stochastic Gradient Descent.
I am also interested in topics in online learning, bandits, and reinforcement learning and game theory.