IDEAL Workshop on Algorithms for Massive Data Sets

At the workshop, we will see many new and exciting results on Algorithms for Massive Data Sets. If you would like to attend this workshop, please register here.

Speakers

Sami Davies (Northwestern), Sepideh Mahabadi (Microsoft Research), Jelani Nelson (UC Berkeley), Elizaveta Rebrova (Princeton), Rachel Ward (UT Austin), David Woodruff (CMU)

Logistics

Dates: Friday, May 13
Location: Northwestern University
Rooms: Mudd Library 3514
Streaming: Zoom (please, see above)

Schedule

All times are in the Central Time Zone (CDT; Chicago time).

8:40-9:05am: Breakfast
9:05am: Opening Remarks
9:10am: Sami Davies, Characterizing Good Predictions for Learning-augmented Algorithms with Instance-robustness (video)
10:10am: Sepideh Mahabadi, Non-Adaptive Adaptive Sampling in Turnstile Streams (video)
11:10am: Coffee Break
11:20am: Elizaveta Rebrova, Randomized projection methods for corrupted data (video)
12:20pm: Lunch
1:10pm: Jelani Nelson, Private Frequency Estimation via Projective Geometry (video)
2:10pm: David Woodruff, Memory Bounds for the Experts Problem (video)
3:10pm: Rachel Ward, Faster Johnson-Lindenstrauss embeddings via Kronecker products (video)

Titles and Abstracts

Title: Characterizing Good Predictions for Learning-augmented Algorithms with Instance-robustness
Speaker: Sami Davies (Northwestern)
Abstract: Often, theoretical worst-case performance bounds (e.g. approximation and competitive ratios, upper bounds on runtimes and sample complexities, etc.) are far worse than what one can achieve in practice. One reason this happens is that in practice instances have certain observable, underlying structure that an algorithm can exploit. For example, in an online scheduling problem, the jobs that arrive today might look very similar to the jobs that arrived yesterday. The exploding field of learning-augmented algorithms is concerned with designing algorithms that circumvent worst-case bounds by using learned predictions about the input. We need to develop strong theoretical foundations about predictions and their associated algorithms in order to keep the field theoretically well-founded. Predictions and their associated algorithms are instance-robustness, a notion introduced by Lavastida, Moseley, Ravi, and Xu, if for a fixed prediction, small perturbations in the problem instance lead to reasonable degradation in the augmented algorithm’s performance. In this talk, I will (1) give an overview of learning-augmented algorithms, (2) explain why instance-robustness is a foundational requirement for some predictions and discuss how it can improve learning-augmented algorithms on large inputs, and (3) introduce how one can separate predictions for a problem based on the notion of instance-robustness.

Title: Non-Adaptive Adaptive Sampling in Turnstile Streams
Speaker: Sepideh Mahabadi (Microsoft Research)
Abstract: Adaptive sampling is a useful algorithmic tool for data summarization problems in the classical centralized setting, where the entire dataset is available to the single processor performing the computation. Adaptive sampling repeatedly selects rows of an underlying n by d matrix A, where n>>d, with probabilities proportional to their distances to the subspace of the previously selected rows. Intuitively, adaptive sampling seems to be limited to trivial multi-pass algorithms in the streaming model of computation due to its inherently sequential nature of assigning sampling probabilities to each row only after the previous iteration is completed. Surprisingly, we show this is not the case by giving the first one-pass algorithms for adaptive sampling on turnstile streams and using space poly(d,k,log n), where k is the number of adaptive sampling rounds to be performed. Our adaptive sampling procedure has a number of applications to various data summarization problems on turnstile streams that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model. This includes column subset selection, subspace approximation, projective clustering, and volume maximization.
Joint work with Ilya Razenshteyn, David Woodruff, and Samson Zhou.

Title: Randomized projection methods for corrupted data
Speaker: Elizaveta Rebrova (Princeton)
Abstract: There are very few problems that can match the least squares fitting problem in terms of ubiquity of applications. When the data is large, or comes in a streaming way, randomized iterative methods, such as Randomized Kaczmarz method, provide a truly efficient way to compute the least squares solution. However, if the data is corrupted with arbitrarily large sparse errors, the iterates can be diverted far from the solution by each corrupted equation they encounter. To conquer this issue, we recently proposed robust versions of Randomized Kaczmarz and Stochastic Gradient Descent methods that manage to avoid harmful corrupted equations by exploring the space as they proceed with the iterates. I will also discuss how to use the information obtained on the exploration phase efficiently, and what structural characteristics of the data matrix are crucial for such methods.
Based on joint work with Deanna Needell, Jamie Haddock, and Will Swartworth.

Title: Private Frequency Estimation via Projective Geometry
Speaker: Jelani Nelson (UC Berkeley)
Abstract: We propose a new algorithm ProjectiveGeometryResponse (PGR) for locally differentially private (LDP) frequency estimation. For a universe size of k and with n users, our ε-LDP algorithm has communication cost ⌈log2k⌉ bits in the private coin setting and εlog2e+O(1) in the public coin setting, and has computation cost O(n+kexp(ε)logk) for the server to approximately reconstruct the frequency histogram, while achieving optimal privacy/utility tradeoff, including optimality of the leading constant factor. Our empirical evaluation shows a speedup of over 50x over PI-RAPPOR (Feldman and Talwar; 2021), while using approximately 75x less memory for practically relevant parameter settings. In addition, the running time of our algorithm is within an order of magnitude of HadamardResponse (Acharya, Sun, and Zhang; 2019) and RecursiveHadamardResponse (Chen, Kairouz, and Ozgur; 2020) which have significantly worse reconstruction error. Our new algorithm is based on using Projective Planes over a finite field to define a small collection of sets that are close to being pairwise independent and a dynamic programming algorithm for approximate histogram reconstruction on the server side. We also give an extension of PGR, which we call HybridProjectiveGeometryResponse, that allows trading off computation time with utility smoothly.

Joint work with Vitaly Feldman (Apple), Huy Le Nguyen (Northeastern), and Kunal Talwar (Apple).

Title: Memory Bounds for the Experts Problem
Speaker: David Woodruff (CMU)
Abstract: Online learning with expert advice is a fundamental problem of sequential prediction. In this problem, the algorithm has access to a set of n “experts” who make predictions on each day. The goal on each day is to process these predictions, and make a prediction with the minimum cost. After making a prediction, the algorithm sees the actual outcome on that day, updates its state, and then moves on to the next day. An algorithm is judged by how well it does compared to the best expert in the set.

The classical algorithm for this problem is the multiplicative weights algorithm. Variations of this algorithm have been applied to and optimized for a broad range of problems, including boosting an ensemble of weak-learners in machine learning, and approximately solving linear and semi-definite programs. However, every application, to our knowledge, relies on storing weights for every expert, and uses Omega(n) memory. There is little work on understanding the memory required to solve the online learning with expert advice problem (or run standard sequential prediction algorithms, such as multiplicative weights) in natural streaming models, which is important when the number of experts, as well as the number of days on which the experts make predictions, is large.

We initiate the study of the learning with expert advice problem in the streaming setting, and show lower and upper bounds. Our lower bound for i.i.d., random order, and adversarial order streams uses a novel masking technique and distributional detection game to show a smooth trade-off for regret versus memory. Our upper bounds in adversarial and random-order streams show ways to run standard sequential prediction algorithms in rounds on small “pools” of experts, thus reducing the necessary memory. For random-order streams, we show that our upper bound is tight up to low order terms.

Joint work with Vaidehi Srinivas, Ziyu (Neil) Xu, and Samson Zhou.

Title: Faster Johnson-Lindenstrauss embeddings via Kronecker products
Speaker: Rachel Ward (UT Austin)
Abstract: Kronecker product is an important matrix operation with a wide range of applications in signal processing, graph theory, quantum computing and deep learning. In this work, we introduce a generalization of the fast Johnson–Lindenstrauss projection for embedding vectors with Kronecker product structure, the Kronecker fast Johnson–Lindenstrauss transform (KFJLT). The KFJLT reduces the embedding cost by an exponential factor of the standard fast Johnson–Lindenstrauss transform’s cost when applied to vectors with Kronecker structure, by avoiding explicitly forming the full Kronecker products. We discuss how this computational gain comes with a quantifiable price in embedding power that scales logarithmically with the degree of the Kronecker product, and provide a sketch of the proof of this result.

About the Series

The IDEAL workshop series brings in experts on topics related to the foundations of data science to present their perspective and research on a common theme. Chicago area researchers with an interest in the foundations of data science.

This workshop is part of the Special Quarter on High-Dimensional Data Analysis. It is organized by Konstantin Makarychev (NU) and Yury Makarychev (TTIC).