Distributed Bootstrap for Massive Data
Modern massive data, with enormous sample size and tremendous dimensionality, are usually stored and processed using a cluster of nodes in a master-worker architecture. A shortcoming of this architecture is that inter-node communication can be over a thousand times slower than intra-node computation, which makes communication efficiency a desirable feature when developing distributed learning algorithms. In this dissertation, we tackle this challenge and propose communication-efficient bootstrap methods for simultaneous inference in the distributed computational framework.
First, we propose two generic distributed bootstrap methods, \texttt{k-grad} and \texttt{n+k-1-grad}, which apply multiplier bootstrap at the master node on the gradients communicated across nodes. Based on them, we develop a communication-efficient method of producing an $\ell_\infty$-norm confidence region using distributed data with dimensionality not exceeding the local sample size. Our theory establishes the communication efficiency by providing a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency and showing that $\tau_{\min}$ only increases logarithmically with the number of workers and the dimensionality. Our simulation studies validate our theory.
Then, we extend \texttt{k-grad} and \texttt{n+k-1-grad} to the high-dimensional regime and propose a distributed bootstrap method for simultaneous inference on high-dimensional distributed data. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $\tau_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset.
History
Degree Type
- Doctor of Philosophy
Department
- Statistics
Campus location
- West Lafayette