Fall 2020

Theory of Deep Learning

September 15 – December 12, 2020

Synopsis

Deep learning plays a central role in the recent revolution of artificial intelligence and data science. In a wide range of applications, such as computer vision, natural language processing, and robotics, deep learning achieves dramatic performance improvements over existing baselines and even human. Despite the empirical success of deep learning, its theoretical foundation remains less understood, which hinders the development of more principled methods with performance guarantees. In particular, such a lack of performance guarantees makes it challenging to incorporate deep learning into applications that involve decision making with critical consequences, such as healthcare and autonomous driving.

Towards theoretically understanding deep learning, many basic questions lack satisfying answers:

The objective function for training a neural network is highly non-convex. From an optimization perspective, why does stochastic gradient descent often converge to a desired solution in practice?
The number of parameters of a neural network generally far exceeds the number of training data points (also known as over-parametrization). From a statistical perspective, why can the learned neural network generalize to testing data points, even though classical ML theory suggests serious overfitting?
From an information-theoretic perspective, how to characterize the form and/or the amount of information each hidden layer has about the input and output of a deep neural network?

Organizers

Nathan Srebro (TTIC), Zhaoran Wang (Northwestern University), Dongning Guo (Northwestern University)

Participation

The Special Quarter will be run on the gather.town platform, which will support a virtual space for classes, meetings, seminars, and water-cooler interactions. Registration will give you access to the virtual institute, and allow you to participate in these activities.

Please fill out this form to apply to participate.

Graduate Courses

The following graduate courses are coordinated to facilitate attendance by participants of the special quarter. These courses will be offered remotely to students attending the program. Any questions can be directed to the instructors.

Statistical and Computational Learning Theory.
T/Th 2:40-4:00pm, TTIC, Prof. Nathan Srebro
Streaming of lectures in seminar room of IDEAL’s Fall 2020 Gather.town space
Fundamentals of Deep Learning.
M/W/F 1:50-2:40pm, TTIC, Prof. David McAllester
Information Theory, Statistical Learning, and Reinforcement Learning.
T/Th 7:40-9:00pm, Northwestern, Prof. Zhaoran Wang
Information Theory and Learning.
M/W 9:40-11:00am, Northwestern, Prof. Dongning Guo
Panopto streaming links for lectures available on bulletin board of IDEAL’s Fall 2020 Gather.town space

Upcoming Events

See the institute event listing for a full listing of upcoming events

Events

September 15th, 4:00 pm Central: Kickoff Event
This Special Quarter is sponsored by The Institute for Data, Econometrics, Algorithms, and Learning (IDEAL), a multi-discipline, multi-institution collaborative institute that focuses on key aspects of the theoretical foundations of data science. This is the second installment after a successful Special Quarter in spring 2020 on Inference and Data Science on Networks. An exciting program has been planned for the quarter, including four courses, a seminar series, and virtual social events – all free of charge! By organizing these group activities, we the organizers hope to create an environment for all participants including speakers and instructors to learn from each other, and also to catalyze research collaboration in the focus area of this Special Quarter.
The kick-off event for this quarter will be held on Tuesday September 15, 2020 at 4 pm Chicago/Central time. We will briefly introduce the institute, the key personnel, the quarter-long courses, and other programs. We will also take you to a tour of our virtual institute on http://gather.town – an amazing virtual space where you can “walk” and meet other participants to video chat and to even work together. Please join us at the kick-off event and mingle!
October 1st, 11:30 am Central: Seminar – Babak Hassibi (California Institute of Technology)
Watch the Recording of Babak Hassibi’s Talk
Title: “The Blind Men and the Elephant: The Mysteries of Deep Learning”
Abstract: Deep learning has demonstrably enjoyed a great deal of recent practical success and is arguably the main driver behind the resurgent interest in machine learning and AI. Despite its tremendous empirical achievements, we are far from a theoretical understanding of deep networks. In this talk, we will argue that the success of deep learning is not only due to the special deep architecture of the models, but also due to the behavior of the stochastic descent methods used, which play a key role in reaching “good” solutions that generalize well to unseen data. We will connect learning algorithms such as stochastic gradient descent (SGD) and stochastic mirror descent (SMD) to work in H-infinity control in the 1990’s, and thereby explain the convergence and implicit-regularization behavior of the algorithms when we are highly over-parametrized (what is now being called the “interpolating regime”). This gives us insight into why deep networks exhibit such powerful generalization abilities, a phenomenon now being referred to as “the blessing of dimensionality”.
October 6th, 4:00 pm Central: Seminar – Julia Gaudio (MIT)
Watch the Recording of Julia Gaudio’s Talk
Title: “Regression Under Sparsity”
Abstract: In high-dimensional statistics, where the number of data features may be much greater than the number of data points, estimation is a difficult task both computationally and statistically. One way to make high-dimensional models more tractable is to enforce sparsity. In this talk, I will present recent work with David Gamarnik, in which we introduced the sparse isotonic regression model. Isotonic regression is the problem of estimating an unknown coordinate-wise monotone function given noisy measurements. In the sparse version, only a small unknown subset of the features (“active coordinates”) determines the output. We provide an upper bound on the expected VC entropy of the space of sparse coordinate-wise monotone functions, and identify the regime of statistical consistency of our estimator. We also propose a linear program to recover the active coordinates, and provide theoretical recovery guarantees. I will additionally discuss an extension to sparse monotone multi-index models.
Bio: Julia Gaudio is an Applied Mathematics Instructor at the MIT Department of Mathematics, working with Elchanan Mossel. She obtained her PhD from the MIT Operations Research Center, advised by David Gamarnik and Patrick Jaillet. Her PhD was supported by a Microsoft Research PhD Fellowship. Prior to that, she studied applied mathematics (BS) and computer science (MS) at Brown University. Julia’s research is focused on high-dimensional probability and statistics. In recent work, she has studied settings with missing data and sparsity.
October 8th, 11:30 am Central: Seminar – Jason Lee (Princeton University)
Watch the Recording of Jason Lee’s Talk
Title: “Beyond Linearization in Deep Learning: Hierarchical Learning and the Benefit of Representation”
Abstract: Deep neural networks can empirically perform efficient hierarchical learning, in which the layers learn useful representations of the data. However, how they make use of the intermediate representations are not explained by recent theories that relate them to “shallow learners” such as kernels. In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks and can be advantageous over raw inputs. We consider a fixed, randomly initialized neural network as a representation function fed into another trainable network. When the trainable network is the quadratic Taylor model of a wide two-layer network, we show that neural representation can achieve improved sample complexities compared with the raw input: For learning a low-rank degree-p polynomial (p≥4) in d dimension, neural representation requires only O~(d⌈p/2⌉) samples, while the best-known sample complexity upper bound for the raw input is O~(dp−1). We contrast our result with a lower bound showing that neural representations do not improve over the raw input (in the infinite width limit), when the trainable network is instead a neural tangent kernel. Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.
Bio: Jason Lee is an assistant professor in Electrical Engineering and Computer Science (courtesy) at Princeton University. Prior to that, he was in the Data Science and Operations department at the University of Southern California and a postdoctoral researcher at UC Berkeley working with Michael I. Jordan. Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in the theory of machine learning, optimization, and statistics. Lately, he has worked on the foundations of deep learning, non-convex optimization algorithm, and reinforcement learning. He has received a Sloan Research Fellowship in 2019, NIPS Best Student Paper Award for his work, and Finalist for the Best Paper Prize for Young Researchers in Continuous Optimization.
October 13th, 4:00 pm Central: Seminar – Daniel Hsu (Columbia University)
Watch the Recording of Daniel Hsu’s Talk
Title: “Contrastive learning, multi-view redundancy, and linear models”
Abstract: Contrastive learning is a “self-supervised” approach to representation learning that uses naturally occurring similar and dissimilar pairs of data points to find useful embeddings of data. We study contrastive learning in the context of multi-view statistical models. First, we show that whenever the views of the data are approximately redundant in their ability predict a target function, a low-dimensional embedding obtained via contrastive learning affords a linear predictor with near-optimal predictive accuracy. Second, we show that in the context of topic models, the embedding can be interpreted as a linear transformation of the posterior moments of the hidden topic distribution given the observed words. We also empirically demonstrate that linear classifiers with these representations perform well in document classification tasks with very few labeled examples in a semi-supervised setting.
This is joint work with Akshay Krishnamurthy (MSR) and Christopher Tosh (Columbia).
Bio: Daniel Hsu is an associate professor in the Department of Computer Science and a member of the Data Science Institute, both at Columbia University. Previously, he was a postdoc at Microsoft Research New England, and the Departments of Statistics at Rutgers University and the University of Pennsylvania. He holds a Ph.D. in Computer Science from UC San Diego, and a B.S. in Computer Science and Engineering from UC Berkeley. He was selected by IEEE Intelligent Systems as one of “AI’s 10 to Watch” in 2015 and received a 2016 Sloan Research Fellowship.
Daniel’s research interests are in algorithmic statistics and machine learning. His work has produced the first computationally efficient algorithms for several statistical estimation tasks (including many involving latent variable models such as mixture models, hidden Markov models, and topic models), provided new algorithmic frameworks for solving interactive machine learning problems, and led to the creation of scalable tools for machine learning applications.
His Ph.D. advisor at UCSD was Sanjoy Dasgupta. His postdoctoral stints were with Sham Kakade (at Penn) and Tong Zhang (at Rutgers).
October 15th, 11:30 am Central: Seminar – Quanquan Gu (UCLA)
Watch the Recording of Quanquan Gu’s Talk
Title: Learning Wide Neural Networks: Polylogarithmic Over-parameterization and A Mean Field Perspective
Abstract: A recent line of research in deep learning theory shows that the training of over-parameterized deep neural networks can be characterized by a kernel function called neural tangent kernel (NTK). However, existing results in the NTK regime are limited as they require: (i) an extremely wide neural network, which is impractical, and (ii) the network parameters to stay very close to initialization throughout training, which does not match empirical observation. In this talk, I will explain how these limitations in the current NTK-based analyses can be alleviated. In the first part of this talk, I will show that under certain assumptions, we can prove optimization and generalization guarantees for DNNs with network width polylogarithmic in the training sample size and inverse target test error. In the second part of this talk, I will introduce a mean-field analysis in a generalized neural tangent kernel regime, and show that noisy gradient descent with weight decay can still exhibit a “kernel-like” behavior. Our analysis allows the network parameters trained by noisy gradient descent to be far away from initialization.
Bio: Quanquan Gu is an Assistant Professor of Computer Science at UCLA. His current research is in the area of artificial intelligence and machine learning, with a focus on developing and analyzing nonconvex optimization algorithms for machine learning and building the theoretical foundations of deep learning. He received his Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign in 2014. He is a recipient of the Yahoo! Academic Career Enhancement Award, NSF CAREER Award, Simons Berkeley Research Fellowship, Adobe Data Science Research Award, Salesforce Deep Learning Research Award and AWS Machine Learning Research Award.
October 29th, 11:30 am Central: Seminar – Francis Bach (INRIA)
Watch the Recording of Francis Bach’s Talk
Title: “On the Convergence of Gradient Descent for Wide Two-Layer Neural Networks”
Abstract: Many supervised learning methods are naturally cast as optimization problems. For prediction models which are linear in their parameters, this often leads to convex problems for which many guarantees exist. Models which are non-linear in their parameters such as neural networks lead to non-convex optimization problems for which guarantees are harder to obtain. In this talk, I will consider two-layer neural networks with homogeneous activation functions where the number of hidden neurons tends to infinity, and show how qualitative convergence guarantees may be derived. I will also highlight open problems related to the quantitative behavior of gradient descent for such models. (Joint work with Lénaïc Chizat)
Bio: Francis Bach is a researcher at Inria, leading since 2011 the machine learning team which is part of the Computer Science department at Ecole Normale SupÃ©rieure. He graduated from Ecole Polytechnique in 1997 and completed his Ph.D. in Computer Science at U.C. Berkeley in 2005, working with Professor Michael Jordan. He spent two years in the Mathematical Morphology group at Ecole des Mines de Paris, then he joined the computer vision project-team at Inria/Ecole Normale SupÃ©rieure from 2007 to 2010. Francis Bach is primarily interested in machine learning, and especially in sparse methods, kernel-based learning, large-scale optimization, computer vision and signal processing. He obtained in 2009 a Starting Grant and in 2016 a Consolidator Grant from the European Research Council, and received the Inria young researcher prize in 2012, the ICML test-of-time award in 2014, as well as the Lagrange prize in continuous optimization in 2018, and the Jean-Jacques Moreau prize in 2019. He was elected in 2020 at the French Academy of Sciences. In 2015, he was program co-chair of the International Conference in Machine learning (ICML), and general chair in 2018; he is now co-editor-in-chief of the Journal of Machine Learning Research.
November 5th, 11:30 am Central: Seminar -Matus Telgarsky (University of Illinois, Urbana-Champaign)
Watch the Recording of Matus Telgarsky’s Talk
Title: The dual of the margin: improved analyses and rates of gradient descent’s implicit bias
Abstract: The implicit bias of gradient descent has arisen as a promising explanation for
the good generalization properties of deep networks (Soudry-Hoffer-Nacson-Gunasekar-Srebro, 2018). The purpose of this talk is to demonstrate the effectiveness of a certain dual problem in the analysis of this implicit bias. Concretely, this talk will develop this dual, as well as a variety of consequences in linear and nonlinear settings.
In the linear case, this dual perspective firstly will allow a characterization of the implicit bias, even outside the standard setting of exponentially-tailed losses; in this sense, it is gradient descent, and not a particular loss structure which leads to implicit bias. Secondly, invoking duality in the margin convergence analysis will yield a fast 1/t rate; by contrast, all prior analyses never surpassed 1/sqrt{t}, even in the well-studied boosting setting.
In the nonlinear case, duality will enable the proof of a gradient alignment property: asymptotically, the parameters and their gradients become colinear. Although abstract, this property in turn implies various existing and new margin maximization results. Joint work with Ziwei Ji.
Bio: Matus Telgarsky is an assistant professor at the University of Illinois, Urbana-Champaign, specializing in deep learning theory. He was fortunate to receive a PhD at UCSD under Sanjoy Dasgupta. Other highlights include: co-founding, in 2017, the Midwest ML Symposium (MMLS) with Po-Ling Loh; receiving a 2018 NSF CAREER award; organizing a Simons Insititute summer 2019 program on deep learning with Samy Bengio, Aleskander Madry, and Elchanan Mossel. In 2020, meanwhile, he’s thankful to be alive and employed.
November 10th, 4:00 pm Central: Seminar – Surbhi Goel (MSR NYC)
Watch the Recording of Surbhi Goel’s Talk
Title: Computational Complexity of Learning Neural Networks over Gaussian Marginals
Abstract: A major challenge in the theory of deep learning is to understand the computational complexity of learning basic families of neural networks (NNs). The challenge here arises from the non-convexity of the associated optimization problem. It is well known that the learning problem is computationally intractable in the worst case. Positive results have circumvented this hardness by making assumptions on the distribution as well as the label noise.
In this talk, we focus on the problem of learning shallow NNs under the benign gaussian input distribution. We first show a super-polynomial Statistical Query (SQ) lower bound in the noiseless setting. We further show how to use this result to obtain a super-polynomial SQ lower bound for learning a single neuron in the agnostic noise model. Lastly, on the positive side, we describe a gradient-based algorithm for approximately learning ReLUs which runs in polynomial time.
This talk is based on multiple works with Ilias Diakonikolas, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, Adam Klivans and Mahdi Soltanolkotabi.
Bio: Surbhi Goel is currently a postdoctoral researcher in the Machine Learning group at Microsoft Research, New York. She recently graduated with a PhD from the University of Texas at Austin, advised by Adam Klivans. Her interests lie at the intersection of theory and machine learning with focus on building the theoretical foundations of deep learning. Her honors include a JP Morgan AI Research fellowship, a Simons-Berkeley Research Fellowship and multiple fellowships from the University of Texas at Austin.
November 12th, 11:30 am Central: Seminar – Emmanuel Abbe (EPFL)
Watch the Recording of Emmanuel Abbe’s Talk
Title: Learning monomials and separation between deep learning and SQ
Abstract: We discuss recent results on the problem of learning random degree-k monomials with neural nets trained by GD/SGD. We discuss in particular the role of the initialization and the stochasticity of GD/SGD. We show that SGD with polynomial hyper-parameters on poly-size neural nets can learn parities of any degree, giving a separation between statistical query (SQ) algorithms and deep learning. We then discuss how such a universality does not take place for full-GD or other types of initializations, and discuss the role of the junk-flow. Joint work with C. Sandon (MIT).
Bio: Emmanuel Abbe obtained his M.Sc. in Mathematics at EPFL and his Ph.D. in EECS at MIT. He was a professor at Princeton University until 2018 and he is now a professor of Mathematics and Computer Science at EPFL, where he holds the Chair of Mathematical Data Science. His research interests are in information theory, theoretical machine learning and related mathematical fields. He is the recipient of the Foundation Latsis International Prize, the Bell Labs Prize, the von Neumann Fellowship and the IEEE Information Theory Society Paper Award. He is part of the NSF-Simons Collaboration on the Theoretical Foundations of Deep Learning.
November 19th, 11:30 am Central: Seminar – Rayadurgam Srikant (University of Illinois, Urbana-Champaign)
Watch the Recording of R. Srikant’s Talk
Title: The Role of Explicit Regularization in Overparameterized Neural Networks
Abstract: Overparameterized neural networks have proved to be remarkably successful in many complex tasks such as image classification and deep reinforcement learning. In this talk, we will consider the role of explicit regularization in training overparameterized neural networks. Specifically, we consider ReLU networks and show that the landscape of commonly used regularized loss functions have the property that every local minimum has good memorization and regularization performance. Joint work with Shiyu Liang and Ruoyu Sun.
Bio: R. Srikant is one of two Co-Directors of the C3.ai Digital Transformation Institute, jointly headquartered at UIUC and Berkeley, which is a consortium of universities (Stanford, MIT, CMU, UChicago, Princeton, Berkeley and UIUC) and industries (C3.ai and Microsoft) aimed at promoting research on AI, ML. IoT and cloud computing for the betterment of society. He is also the Fredric G. and Elizabeth H. Nearing Endowed Professor of ECE and the Coordinated Science Lab at UIUC. His research interests are in machine learning, AI, communication networks and applied probability. He is the winner of the 2019 IEEE Koji Kobayashi Computers and Communications Award, the 2017 Applied Probability Society Best Publication Award and the 2015 IEEE INFOCOM Achievement Award. He also won the Distinguished Alumnus Award from the Indian Institute of Technology, Madras.
December 1st, 11:30 am Central: Seminar – Edgar Dobriban (UPenn)
Watch the Recording of Edgar Dobriban’s Talk
Title: On the statistical foundations of adversarially robust learning
Abstract: Robustness has long been viewed as an important desired property of statistical methods. More recently, it has been recognized that complex prediction models such as deep neural nets can be highly vulnerable to adversarially chosen perturbations of their outputs at test time. This area, termed adversarial robustness, has garnered an extraordinary level of attention in the machine learning community over the last few years. However, little is known about the most basic statistical questions. In this talk, I will present answers to some of them. In particular, I will show how class imbalance has a crucial effect, and leads to unavoidable tradeoffs between robustness and accuracy, even in the limit of infinite data (i.e., for the Bayes error). I will also show other results, some of them involving novel applications of results from robust isoperimetry (Cianchi et al, 2011). This is joint work with Hamed Hassani, David Hong, and Alex Robey.
Bio: Edgar Dobriban is an assistant professor of statistics in the Wharton School at the University of Pennsylvania. He obtained a PhD in statistics from Stanford University in 2017, and a BA in Mathematics from Princeton University in 2012. His research interests center around the statistical analysis of large datasets (multivariate statistics, multiple testing, dimension reduction, high-dimensional regression etc), and the theoretical analysis of machine learning, especially on making deep learning heuristics principled and rigorous. He has received the 2017 Theodore W. Anderson award for the best PhD in theoretical statistics from Stanford University. At Penn, he has been involved in a number of grants, including an NSF BIGDATA grant, an NSF TRIPODS Center for the Foundations of Information Processing (FINPENN), the NSF-Simons Collaboration on the Mathematical and Scientific Foundations of Deep Learning (https://www.minds.jhu.edu/theorinet/), and the ARO MURI Robust Concept Learning and Lifelong Adaptation against Adversarial Attacks. He has been a visiting scholar at the Simons Institute (2017: Foundations of Deep Learning). In 2021, he will receive the NSF CAREER award. To find out more, please see Edgar’s website: https://statistics.wharton.upenn.edu/profile/dobriban/.
December 3rd, 11:30 am Central: Seminar – Andrea Montanari (Stanford University)
Watch the Recording of Andrea Montanari’s Talk
Title: The generalization error of random feature and neural tangent models
Abstract: Modern machine learning methods –most noticeably multi-layer neural networks– require to fit highly non-linear models comprising tens of thousands to millions of parameters. Despite this, little attention is paid to the regularization mechanism to control model’s complexity. Indeed, the resulting models are often so complex as to achieve vanishing training error. Despite this, these models generalize well to unseen data : they have small test error. I will discuss several examples of this phenomenon, beginning with a simple linear regression model, and ending with two-layers neural networks in the so-called lazy regime. For these examples precise asymptotics could be determined mathematically, using tools from random matrix theory. I will try to extract a unifying picture. A common feature is the fact that a complex unregularized nonlinear model becomes essentially equivalent to a simpler model, which is however regularized in a non-trivial way. [Based on joint papers with: Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Feng Ruan, Youngtak Sohn, Jun Yan, Yiqiao Zhong]
Bio: Andrea Montanari received a Laurea degree in Physics in 1997, and a Ph. D. in Theoretical Physics in 2001 (both from Scuola Normale Superiore in Pisa, Italy). He has been post-doctoral fellow at Laboratoire de Physique Théorique de l’Ecole Normale Supérieure (LPTENS), Paris, France, and the Mathematical Sciences Research Institute, Berkeley, USA. From 2002 to 2010 he was Chargé de Recherche (with Centre National de la Recherche Scientifique, CNRS) at LPTENS. In September 2006 he joined Stanford University as a faculty, and since 2015 he is Full Professor in the Departments of Electrical Engineering and Statistics. He was awarded the CNRS bronze medal for theoretical physics in 2006, the National Science Foundation CAREER award in 2008, the Okawa Foundation Research Grant in 2013, the James L. Massey Award of the Information Theory Society in 2016, and the Le Cam Prize of the French Statistical Society in 2020. He received the ACM SIGMETRICS best paper award in 2008 and the Applied Probability Society Best Publication Award in 2015 He was elevated to IEEE Fellow in 2017 and IMS Fellow in 2020. He was an invited sectional speaker at the 2020 International Congress of Mathematicians and an IMS Medallion lecturer for the 2020 Bernoulli-IMS World Congress.

IDEAL

The Institute for Data, Econometrics, Algorithms, and Learning