Seminar: The Unexpected Effects of Data Augmentation and Some Analyses by Gaussian Universality
Abstract
A staple in modern-day machine learning, data augmentation (DA) refers to arange of heuristics that synthetically enlarge a data set by random transformations: Aprototypical example is the inclusion of randomly rotated versions of the same image in thetraining data set for image classification. Common motivations of data augmentation includehigh-dimensionality of data relative to sample size, known geometric structures in the data,and a desire for robustness and stability of learning algorithms. Empirically, despite manysuccesses, it remains unclear the extent to which different DA techniques can help indifferent applications, and whether the computational budget for DA should instead be allocated for e.g. scaling the network. Theoretically, there is not yet a principled way tounderstand the effects of DA, since the dependence it introduces violates the commonlyused i.i.d.~assumption. In this talk, I shall present several analyses of DA via (Gaussian)universality, a probabilistic tool that has traditionally been used in distributionalapproximations and random matrix theory, but has gained a lot of popularity in machinelearning theory in recent years. We will see how universality enables the analysis of DA insetups beyond dependence and geometric invariances, and yields nuanced understandingsin different settings. These include the double-descent risk curve of high-dimensionalinterpolators, classification risk of logistic regression, and if time permits, the training of alarge-scale neural network in an AI-for-physics application. The talk will touch on aspects ofthe following papers: https://arxiv.org/abs/2202.09134 (under journal review),https://arxiv.org/abs/2502.15752 (COLT 2025) and https://arxiv.org/abs/2502.05318 (ICML2025).
Biography
Kevin is a final-year PhD candidate in machine learning at the Gatsby Unitat University College London, advised jointly by Peter Orbanz at Gatsby and by MorganeAustern at Harvard Statistics. He is starting as a ProbAI postdoctoral researcher workingwith Gareth Roberts at Warwick Statistics and Boris Hanin at Princeton Operations Research& Financial Engineering. Kevin’s research interests sit at the intersection of machine learningtheory, probability and computational statistics, with a particular focus on the theoreticalbehaviours of large-scale stochastic systems that emerge in ML settings.