Fall 2020

Theory of Deep Learning

September 15 – December 12, 2020


Deep learning plays a central role in the recent revolution of artificial intelligence and data science. In a wide range of applications, such as computer vision, natural language processing, and robotics, deep learning achieves dramatic performance improvements over existing baselines and even human. Despite the empirical success of deep learning, its theoretical foundation remains less understood, which hinders the development of more principled methods with performance guarantees. In particular, such a lack of performance guarantees makes it challenging to incorporate deep learning into applications that involve decision making with critical consequences, such as healthcare and autonomous driving.

Towards theoretically understanding deep learning, many basic questions lack satisfying answers:

  1. The objective function for training a neural network is highly non-convex. From an optimization perspective, why does stochastic gradient descent often converge to a desired solution in practice?
  2. The number of parameters of a neural network generally far exceeds the number of training data points (also known as over-parametrization). From a statistical perspective, why can the learned neural network generalize to testing data points, even though classical ML theory suggests serious overfitting?
  3. From an information-theoretic perspective, how to characterize the form and/or the amount of information each hidden layer has about the input and output of a deep neural network?


Nathan Srebro (TTIC), Zhaoran Wang (Northwestern University), Dongning Guo (Northwestern University)


The Special Quarter will be run on the gather.town platform, which will support a virtual space for classes, meetings, seminars, and water-cooler interactions. Registration will give you access to the virtual institute, and allow you to participate in these activities.

Graduate Courses

The following graduate courses are coordinated to facilitate attendance by participants of the special quarter. These courses will be offered remotely to students attending the program. Any questions can be directed to the instructors.

Upcoming Events

  • October 13th, 4:00 pm Central: Seminar – Daniel Hsu (Columbia University)
    Join us for the Livestream of Daniel Hsu’s Talk
    Title: “Contrastive learning, multi-view redundancy, and linear models”
    Abstract: Contrastive learning is a “self-supervised” approach to representation learning that uses naturally occurring similar and dissimilar pairs of data points to find useful embeddings of data. We study contrastive learning in the context of multi-view statistical models. First, we show that whenever the views of the data are approximately redundant in their ability predict a target function, a low-dimensional embedding obtained via contrastive learning affords a linear predictor with near-optimal predictive accuracy. Second, we show that in the context of topic models, the embedding can be interpreted as a linear transformation of the posterior moments of the hidden topic distribution given the observed words. We also empirically demonstrate that linear classifiers with these representations perform well in document classification tasks with very few labeled examples in a semi-supervised setting.
    This is joint work with Akshay Krishnamurthy (MSR) and Christopher Tosh (Columbia).
    Bio: Daniel Hsu is an associate professor in the Department of Computer Science and a member of the Data Science Institute, both at Columbia University. Previously, he was a postdoc at Microsoft Research New England, and the Departments of Statistics at Rutgers University and the University of Pennsylvania. He holds a Ph.D. in Computer Science from UC San Diego, and a B.S. in Computer Science and Engineering from UC Berkeley. He was selected by IEEE Intelligent Systems as one of “AI’s 10 to Watch” in 2015 and received a 2016 Sloan Research Fellowship.
    Daniel’s research interests are in algorithmic statistics and machine learning. His work has produced the first computationally efficient algorithms for several statistical estimation tasks (including many involving latent variable models such as mixture models, hidden Markov models, and topic models), provided new algorithmic frameworks for solving interactive machine learning problems, and led to the creation of scalable tools for machine learning applications.
    His Ph.D. advisor at UCSD was Sanjoy Dasgupta. His postdoctoral stints were with Sham Kakade (at Penn) and Tong Zhang (at Rutgers).
  • October 15th, 11:30 am Central: Seminar – Quanquan Gu (UCLA)
    Join us for the Livestream of Quanquan Gu’s Talk
    Title: Learning Wide Neural Networks: Polylogarithmic Over-parameterization and A Mean Field Perspective
    Abstract: A recent line of research in deep learning theory shows that the training of over-parameterized deep neural networks can be characterized by a kernel function called neural tangent kernel (NTK). However, existing results in the NTK regime are limited as they require: (i) an extremely wide neural network, which is impractical, and (ii) the network parameters to stay very close to initialization throughout training, which does not match empirical observation. In this talk, I will explain how these limitations in the current NTK-based analyses can be alleviated. In the first part of this talk, I will show that under certain assumptions, we can prove optimization and generalization guarantees for DNNs with network width polylogarithmic in the training sample size and inverse target test error. In the second part of this talk, I will introduce a mean-field analysis in a generalized neural tangent kernel regime, and show that noisy gradient descent with weight decay can still exhibit a “kernel-like” behavior. Our analysis allows the network parameters trained by noisy gradient descent to be far away from initialization.
    Bio: Quanquan Gu is an Assistant Professor of Computer Science at UCLA. His current research is in the area of artificial intelligence and machine learning, with a focus on developing and analyzing nonconvex optimization algorithms for machine learning and building the theoretical foundations of deep learning. He received his Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign in 2014. He is a recipient of the Yahoo! Academic Career Enhancement Award, NSF CAREER Award, Simons Berkeley Research Fellowship, Adobe Data Science Research Award, Salesforce Deep Learning Research Award and AWS Machine Learning Research Award.
  • October 29th, 11:30 am Central: Seminar – Francis Bach (INRIA)
    Title: “On the Convergence of Gradient Descent for Wide Two-Layer Neural Networks”
    Abstract: Many supervised learning methods are naturally cast as optimization problems. For prediction models which are linear in their parameters, this often leads to convex problems for which many guarantees exist. Models which are non-linear in their parameters such as neural networks lead to non-convex optimization problems for which guarantees are harder to obtain. In this talk, I will consider two-layer neural networks with homogeneous activation functions where the number of hidden neurons tends to infinity, and show how qualitative convergence guarantees may be derived. I will also highlight open problems related to the quantitative behavior of gradient descent for such models. (Joint work with Lénaïc Chizat)
    Bio: Francis Bach is a researcher at Inria, leading since 2011 the machine learning team which is part of the Computer Science department at Ecole Normale Supérieure. He graduated from Ecole Polytechnique in 1997 and completed his Ph.D. in Computer Science at U.C. Berkeley in 2005, working with Professor Michael Jordan. He spent two years in the Mathematical Morphology group at Ecole des Mines de Paris, then he joined the computer vision project-team at Inria/Ecole Normale Supérieure from 2007 to 2010. Francis Bach is primarily interested in machine learning, and especially in sparse methods, kernel-based learning, large-scale optimization, computer vision and signal processing. He obtained in 2009 a Starting Grant and in 2016 a Consolidator Grant from the European Research Council, and received the Inria young researcher prize in 2012, the ICML test-of-time award in 2014, as well as the Lagrange prize in continuous optimization in 2018, and the Jean-Jacques Moreau prize in 2019. He was elected in 2020 at the French Academy of Sciences. In 2015, he was program co-chair of the International Conference in Machine learning (ICML), and general chair in 2018; he is now co-editor-in-chief of the Journal of Machine Learning Research.
  • November 5th, 11:30 am Central: Seminar -Matus Telgarsky (University of Illinois, Urbana-Champaign)
    Title and abstract TBA
  • November 10th, 4:00 pm Central: Seminar – Surbhi Goel (MSR NYC)
    Title and abstract TBA
  • November 12th, 11:30 am Central: Seminar – Emmanuel Abbe (EPFL)
    Title and abstract TBA
  • November 19th, 11:30 am Central: Seminar – Rayadurgam Srikant (University of Illinois, Urbana-Champaign)
    Title and abstract TBA
  • December 1st, 4:00 pm Central: Seminar – Edgar Dobriban (University of Pennsylvania)
    Title and abstract TBA
  • December 3rd, 11:30 am Central: Seminar – Andrea Montenari (Stanford University)
    Title and abstract TBA

Past Events

  • September 15th, 4:00 pm Central: Kickoff Event
    This Special Quarter is sponsored by The Institute for Data, Econometrics, Algorithms, and Learning (IDEAL), a multi-discipline, multi-institution collaborative institute that focuses on key aspects of the theoretical foundations of data science. This is the second installment after a successful Special Quarter in spring 2020 on Inference and Data Science on Networks. An exciting program has been planned for the quarter, including four courses, a seminar series, and virtual social events – all free of charge! By organizing these group activities, we the organizers hope to create an environment for all participants including speakers and instructors to learn from each other, and also to catalyze research collaboration in the focus area of this Special Quarter.
    The kick-off event for this quarter will be held on Tuesday September 15, 2020 at 4 pm Chicago/Central time. We will briefly introduce the institute, the key personnel, the quarter-long courses, and other programs. We will also take you to a tour of our virtual institute on http://gather.town – an amazing virtual space where you can “walk” and meet other participants to video chat and to even work together. Please join us at the kick-off event and mingle!
  • October 1st, 11:30 am Central: Seminar – Babak Hassibi (California Institute of Technology)
    Watch the Recording of Babak Hassibi’s Talk
    Title: “The Blind Men and the Elephant: The Mysteries of Deep Learning”
    Abstract: Deep learning has demonstrably enjoyed a great deal of recent practical success and is arguably the main driver behind the resurgent interest in machine learning and AI. Despite its tremendous empirical achievements, we are far from a theoretical understanding of deep networks. In this talk, we will argue that the success of deep learning is not only due to the special deep architecture of the models, but also due to the behavior of the stochastic descent methods used, which play a key role in reaching “good” solutions that generalize well to unseen data. We will connect learning algorithms such as stochastic gradient descent (SGD) and stochastic mirror descent (SMD) to work in H-infinity control in the 1990’s, and thereby explain the convergence and implicit-regularization behavior of the algorithms when we are highly over-parametrized (what is now being called the “interpolating regime”). This gives us insight into why deep networks exhibit such powerful generalization abilities, a phenomenon now being referred to as “the blessing of dimensionality”.
  • October 6th, 4:00 pm Central: Seminar – Julia Gaudio (MIT)
    Watch the Recording of Julia Gaudio’s Talk
    Title: “Regression Under Sparsity”
    Abstract: In high-dimensional statistics, where the number of data features may be much greater than the number of data points, estimation is a difficult task both computationally and statistically. One way to make high-dimensional models more tractable is to enforce sparsity. In this talk, I will present recent work with David Gamarnik, in which we introduced the sparse isotonic regression model. Isotonic regression is the problem of estimating an unknown coordinate-wise monotone function given noisy measurements. In the sparse version, only a small unknown subset of the features (“active coordinates”) determines the output. We provide an upper bound on the expected VC entropy of the space of sparse coordinate-wise monotone functions, and identify the regime of statistical consistency of our estimator. We also propose a linear program to recover the active coordinates, and provide theoretical recovery guarantees. I will additionally discuss an extension to sparse monotone multi-index models.
    Bio: Julia Gaudio is an Applied Mathematics Instructor at the MIT Department of Mathematics, working with Elchanan Mossel. She obtained her PhD from the MIT Operations Research Center, advised by David Gamarnik and Patrick Jaillet. Her PhD was supported by a Microsoft Research PhD Fellowship. Prior to that, she studied applied mathematics (BS) and computer science (MS) at Brown University. Julia’s research is focused on high-dimensional probability and statistics. In recent work, she has studied settings with missing data and sparsity.
  • October 8th, 11:30 am Central: Seminar – Jason Lee (Princeton University)
    Watch the Recording of Jason Lee’s Talk
    Title: “Beyond Linearization in Deep Learning: Hierarchical Learning and the Benefit of Representation”
    Abstract: Deep neural networks can empirically perform efficient hierarchical learning, in which the layers learn useful representations of the data. However, how they make use of the intermediate representations are not explained by recent theories that relate them to “shallow learners” such as kernels. In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks and can be advantageous over raw inputs. We consider a fixed, randomly initialized neural network as a representation function fed into another trainable network. When the trainable network is the quadratic Taylor model of a wide two-layer network, we show that neural representation can achieve improved sample complexities compared with the raw input: For learning a low-rank degree-p polynomial (p≥4) in d dimension, neural representation requires only O~(d⌈p/2⌉) samples, while the best-known sample complexity upper bound for the raw input is O~(dp−1). We contrast our result with a lower bound showing that neural representations do not improve over the raw input (in the infinite width limit), when the trainable network is instead a neural tangent kernel. Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.
    Bio: Jason Lee is an assistant professor in Electrical Engineering and Computer Science (courtesy) at Princeton University. Prior to that, he was in the Data Science and Operations department at the University of Southern California and a postdoctoral researcher at UC Berkeley working with Michael I. Jordan. Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in the theory of machine learning, optimization, and statistics. Lately, he has worked on the foundations of deep learning, non-convex optimization algorithm, and reinforcement learning. He has received a Sloan Research Fellowship in 2019, NIPS Best Student Paper Award for his work, and Finalist for the Best Paper Prize for Young Researchers in Continuous Optimization.


October 2020

Mon Tue Wed Thu Fri Sat Sun
  • IDEAL Seminar: Babak…
  • IDEAL Seminar: Julia…
  • IDEAL Seminar: Jason…
  • IDEAL Seminar: Danie…
  • IDEAL Seminar: Quanq…
  • IDEAL Seminar: Franc…