Theory of Deep Learning
September 15 – December 12, 2020
Synopsis
Deep learning plays a central role in the recent revolution of artificial intelligence and data science. In a wide range of applications, such as computer vision, natural language processing, and robotics, deep learning achieves dramatic performance improvements over existing baselines and even human. Despite the empirical success of deep learning, its theoretical foundation remains less understood, which hinders the development of more principled methods with performance guarantees. In particular, such a lack of performance guarantees makes it challenging to incorporate deep learning into applications that involve decision making with critical consequences, such as healthcare and autonomous driving.
Towards theoretically understanding deep learning, many basic questions lack satisfying answers:
 The objective function for training a neural network is highly nonconvex. From an optimization perspective, why does stochastic gradient descent often converge to a desired solution in practice?
 The number of parameters of a neural network generally far exceeds the number of training data points (also known as overparametrization). From a statistical perspective, why can the learned neural network generalize to testing data points, even though classical ML theory suggests serious overfitting?
 From an informationtheoretic perspective, how to characterize the form and/or the amount of information each hidden layer has about the input and output of a deep neural network?
Organizers
Participation
Graduate Courses
T/Th 2:404:00pm, TTIC, Prof. Nathan Srebro
Streaming of lectures in seminar room of IDEAL’s Fall 2020 Gather.town space
Upcoming Events
 October 13th, 4:00 pm Central: Seminar – Daniel Hsu (Columbia University)
Join us for the Livestream of Daniel Hsu’s Talk
Title: “Contrastive learning, multiview redundancy, and linear models”
Abstract: Contrastive learning is a “selfsupervised” approach to representation learning that uses naturally occurring similar and dissimilar pairs of data points to find useful embeddings of data. We study contrastive learning in the context of multiview statistical models. First, we show that whenever the views of the data are approximately redundant in their ability predict a target function, a lowdimensional embedding obtained via contrastive learning affords a linear predictor with nearoptimal predictive accuracy. Second, we show that in the context of topic models, the embedding can be interpreted as a linear transformation of the posterior moments of the hidden topic distribution given the observed words. We also empirically demonstrate that linear classifiers with these representations perform well in document classification tasks with very few labeled examples in a semisupervised setting.
This is joint work with Akshay Krishnamurthy (MSR) and Christopher Tosh (Columbia).
Bio: Daniel Hsu is an associate professor in the Department of Computer Science and a member of the Data Science Institute, both at Columbia University. Previously, he was a postdoc at Microsoft Research New England, and the Departments of Statistics at Rutgers University and the University of Pennsylvania. He holds a Ph.D. in Computer Science from UC San Diego, and a B.S. in Computer Science and Engineering from UC Berkeley. He was selected by IEEE Intelligent Systems as one of “AI’s 10 to Watch” in 2015 and received a 2016 Sloan Research Fellowship.
Daniel’s research interests are in algorithmic statistics and machine learning. His work has produced the first computationally efficient algorithms for several statistical estimation tasks (including many involving latent variable models such as mixture models, hidden Markov models, and topic models), provided new algorithmic frameworks for solving interactive machine learning problems, and led to the creation of scalable tools for machine learning applications.
His Ph.D. advisor at UCSD was Sanjoy Dasgupta. His postdoctoral stints were with Sham Kakade (at Penn) and Tong Zhang (at Rutgers).  October 15th, 11:30 am Central: Seminar – Quanquan Gu (UCLA)
Join us for the Livestream of Quanquan Gu’s Talk
Title: Learning Wide Neural Networks: Polylogarithmic Overparameterization and A Mean Field Perspective
Abstract: A recent line of research in deep learning theory shows that the training of overparameterized deep neural networks can be characterized by a kernel function called neural tangent kernel (NTK). However, existing results in the NTK regime are limited as they require: (i) an extremely wide neural network, which is impractical, and (ii) the network parameters to stay very close to initialization throughout training, which does not match empirical observation. In this talk, I will explain how these limitations in the current NTKbased analyses can be alleviated. In the first part of this talk, I will show that under certain assumptions, we can prove optimization and generalization guarantees for DNNs with network width polylogarithmic in the training sample size and inverse target test error. In the second part of this talk, I will introduce a meanfield analysis in a generalized neural tangent kernel regime, and show that noisy gradient descent with weight decay can still exhibit a “kernellike” behavior. Our analysis allows the network parameters trained by noisy gradient descent to be far away from initialization.
Bio: Quanquan Gu is an Assistant Professor of Computer Science at UCLA. His current research is in the area of artificial intelligence and machine learning, with a focus on developing and analyzing nonconvex optimization algorithms for machine learning and building the theoretical foundations of deep learning. He received his Ph.D. degree in Computer Science from the University of Illinois at UrbanaChampaign in 2014. He is a recipient of the Yahoo! Academic Career Enhancement Award, NSF CAREER Award, Simons Berkeley Research Fellowship, Adobe Data Science Research Award, Salesforce Deep Learning Research Award and AWS Machine Learning Research Award.  October 29th, 11:30 am Central: Seminar – Francis Bach (INRIA)
Title: “On the Convergence of Gradient Descent for Wide TwoLayer Neural Networks”
Abstract: Many supervised learning methods are naturally cast as optimization problems. For prediction models which are linear in their parameters, this often leads to convex problems for which many guarantees exist. Models which are nonlinear in their parameters such as neural networks lead to nonconvex optimization problems for which guarantees are harder to obtain. In this talk, I will consider twolayer neural networks with homogeneous activation functions where the number of hidden neurons tends to infinity, and show how qualitative convergence guarantees may be derived. I will also highlight open problems related to the quantitative behavior of gradient descent for such models. (Joint work with Lénaïc Chizat)
Bio: Francis Bach is a researcher at Inria, leading since 2011 the machine learning team which is part of the Computer Science department at Ecole Normale SupÃ©rieure. He graduated from Ecole Polytechnique in 1997 and completed his Ph.D. in Computer Science at U.C. Berkeley in 2005, working with Professor Michael Jordan. He spent two years in the Mathematical Morphology group at Ecole des Mines de Paris, then he joined the computer vision projectteam at Inria/Ecole Normale SupÃ©rieure from 2007 to 2010. Francis Bach is primarily interested in machine learning, and especially in sparse methods, kernelbased learning, largescale optimization, computer vision and signal processing. He obtained in 2009 a Starting Grant and in 2016 a Consolidator Grant from the European Research Council, and received the Inria young researcher prize in 2012, the ICML testoftime award in 2014, as well as the Lagrange prize in continuous optimization in 2018, and the JeanJacques Moreau prize in 2019. He was elected in 2020 at the French Academy of Sciences. In 2015, he was program cochair of the International Conference in Machine learning (ICML), and general chair in 2018; he is now coeditorinchief of the Journal of Machine Learning Research.  November 5th, 11:30 am Central: Seminar Matus Telgarsky (University of Illinois, UrbanaChampaign)
Title and abstract TBA  November 10th, 4:00 pm Central: Seminar – Surbhi Goel (MSR NYC)
Title and abstract TBA  November 12th, 11:30 am Central: Seminar – Emmanuel Abbe (EPFL)
Title and abstract TBA  November 19th, 11:30 am Central: Seminar – Rayadurgam Srikant (University of Illinois, UrbanaChampaign)
Title and abstract TBA  December 1st, 4:00 pm Central: Seminar – Edgar Dobriban (University of Pennsylvania)
Title and abstract TBA  December 3rd, 11:30 am Central: Seminar – Andrea Montenari (Stanford University)
Title and abstract TBA
Past Events
 September 15th, 4:00 pm Central: Kickoff Event
This Special Quarter is sponsored by The Institute for Data, Econometrics, Algorithms, and Learning (IDEAL), a multidiscipline, multiinstitution collaborative institute that focuses on key aspects of the theoretical foundations of data science. This is the second installment after a successful Special Quarter in spring 2020 on Inference and Data Science on Networks. An exciting program has been planned for the quarter, including four courses, a seminar series, and virtual social events – all free of charge! By organizing these group activities, we the organizers hope to create an environment for all participants including speakers and instructors to learn from each other, and also to catalyze research collaboration in the focus area of this Special Quarter.
The kickoff event for this quarter will be held on Tuesday September 15, 2020 at 4 pm Chicago/Central time. We will briefly introduce the institute, the key personnel, the quarterlong courses, and other programs. We will also take you to a tour of our virtual institute on http://gather.town – an amazing virtual space where you can “walk” and meet other participants to video chat and to even work together. Please join us at the kickoff event and mingle!  October 1st, 11:30 am Central: Seminar – Babak Hassibi (California Institute of Technology)
Watch the Recording of Babak Hassibi’s Talk
Title: “The Blind Men and the Elephant: The Mysteries of Deep Learning”
Abstract: Deep learning has demonstrably enjoyed a great deal of recent practical success and is arguably the main driver behind the resurgent interest in machine learning and AI. Despite its tremendous empirical achievements, we are far from a theoretical understanding of deep networks. In this talk, we will argue that the success of deep learning is not only due to the special deep architecture of the models, but also due to the behavior of the stochastic descent methods used, which play a key role in reaching “good” solutions that generalize well to unseen data. We will connect learning algorithms such as stochastic gradient descent (SGD) and stochastic mirror descent (SMD) to work in Hinfinity control in the 1990’s, and thereby explain the convergence and implicitregularization behavior of the algorithms when we are highly overparametrized (what is now being called the “interpolating regime”). This gives us insight into why deep networks exhibit such powerful generalization abilities, a phenomenon now being referred to as “the blessing of dimensionality”.  October 6th, 4:00 pm Central: Seminar – Julia Gaudio (MIT)
Watch the Recording of Julia Gaudio’s Talk
Title: “Regression Under Sparsity”
Abstract: In highdimensional statistics, where the number of data features may be much greater than the number of data points, estimation is a difficult task both computationally and statistically. One way to make highdimensional models more tractable is to enforce sparsity. In this talk, I will present recent work with David Gamarnik, in which we introduced the sparse isotonic regression model. Isotonic regression is the problem of estimating an unknown coordinatewise monotone function given noisy measurements. In the sparse version, only a small unknown subset of the features (“active coordinates”) determines the output. We provide an upper bound on the expected VC entropy of the space of sparse coordinatewise monotone functions, and identify the regime of statistical consistency of our estimator. We also propose a linear program to recover the active coordinates, and provide theoretical recovery guarantees. I will additionally discuss an extension to sparse monotone multiindex models.
Bio: Julia Gaudio is an Applied Mathematics Instructor at the MIT Department of Mathematics, working with Elchanan Mossel. She obtained her PhD from the MIT Operations Research Center, advised by David Gamarnik and Patrick Jaillet. Her PhD was supported by a Microsoft Research PhD Fellowship. Prior to that, she studied applied mathematics (BS) and computer science (MS) at Brown University. Julia’s research is focused on highdimensional probability and statistics. In recent work, she has studied settings with missing data and sparsity.  October 8th, 11:30 am Central: Seminar – Jason Lee (Princeton University)
Watch the Recording of Jason Lee’s Talk
Title: “Beyond Linearization in Deep Learning: Hierarchical Learning and the Benefit of Representation”
Abstract: Deep neural networks can empirically perform efficient hierarchical learning, in which the layers learn useful representations of the data. However, how they make use of the intermediate representations are not explained by recent theories that relate them to “shallow learners” such as kernels. In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks and can be advantageous over raw inputs. We consider a fixed, randomly initialized neural network as a representation function fed into another trainable network. When the trainable network is the quadratic Taylor model of a wide twolayer network, we show that neural representation can achieve improved sample complexities compared with the raw input: For learning a lowrank degreep polynomial (p≥4) in d dimension, neural representation requires only O~(d⌈p/2⌉) samples, while the bestknown sample complexity upper bound for the raw input is O~(dp−1). We contrast our result with a lower bound showing that neural representations do not improve over the raw input (in the infinite width limit), when the trainable network is instead a neural tangent kernel. Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.
Bio: Jason Lee is an assistant professor in Electrical Engineering and Computer Science (courtesy) at Princeton University. Prior to that, he was in the Data Science and Operations department at the University of Southern California and a postdoctoral researcher at UC Berkeley working with Michael I. Jordan. Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in the theory of machine learning, optimization, and statistics. Lately, he has worked on the foundations of deep learning, nonconvex optimization algorithm, and reinforcement learning. He has received a Sloan Research Fellowship in 2019, NIPS Best Student Paper Award for his work, and Finalist for the Best Paper Prize for Young Researchers in Continuous Optimization.
Calendar
October 2020 


Mon  Tue  Wed  Thu  Fri  Sat  Sun 
1

2

3

4


5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31
