Embarrassingly Parallel Statistics and its Applications: Divide & Recombine Methods for Parallel Computation of Quantiles and Construction of K-D Trees for Big-Data

Chakravorty, Aritra

doi:10.25394/PGS.7438955.v1

thesis.pdf (31.11 MB)

Embarrassingly Parallel Statistics and its Applications: Divide & Recombine Methods for Parallel Computation of Quantiles and Construction of K-D Trees for Big-Data

thesis

posted on 2019-01-16, 18:15 authored by Aritra ChakravortyAritra Chakravorty

In Divide & Recombine (D&R), data are divided into subsets, analytic methodsare applied to each subset independently, with no communication between processes;then the subset outputs for each method are recombined. For big data, this providesalmost all of the analytic tasking needed when data are analyzed. It also provideshigh computational performance because typically most of the computation is em-barrassingly parallel, the simplest parallel computation.

Another kind of tasking must address computational performance and numericaccuracy: the computing of functions of all of the data, or “statistics”. For data bigand small, it is often important to compute such statistics for all of the data, whichcan be summaries of the data, such as sample quantiles of continuous variables, orcan process the data into a form that helps analysis, such as dividing the data intorepresentative subsets. Development of computational methods to compute thesestatistics can be challenging.

D&R can be a very effective framework for computing statistics. To supportthis, we introduce the concept of embarrassingly parallel (EP) statistics, both weakand strong. The concept of EP statistics is not entirely new, but has had littledevelopment. The existing methodology is mainly sums of sums. For example, this isdone when computing the necessary statistics for least squares where sums of productsand cross productions are carried out on subsets then summed across subsets. Ourtreatment of EP statistics has taken the concept much further. The outcome is abilityto use EP statistics in conjunction with the use a Fourier series to approximate an optimization criteria. The series terms, which are strongly EP statistics, are summedacross subsets, and the result is optimized. These are EP-F computational methods.

We have so far developed two EP-F computational methods for two widely usedstatistic computations. EP-F-Quantile is for quantiles of big data, and EP-F-KDtreeis for KD-trees. Speed and accuracy of EPF-Quantile are compared with that of thewell-known binning method, which also can be formulated in terms of EP statistics. EPF-KDtree is the first parallel KD-tree computational method of which we areaware. EP and EPF computational methods have potentially many other applicationsto computing statistics.

History

Degree Type

Doctor of Philosophy

Department

Statistics

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

William S Cleveland

Additional Committee Member 2

Patrick J Wolfe

Additional Committee Member 3

Mark D Ward

Additional Committee Member 4

Ryan Hafen

Usage metrics

Keywords

Divide and Recombine Map-Reduce Parallel algorithms.KD-tree quantiles Statistics

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Embarrassingly Parallel Statistics and its Applications: Divide & Recombine Methods for Parallel Computation of Quantiles and Construction of K-D Trees for Big-Data

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Additional Committee Member 2

Additional Committee Member 3

Additional Committee Member 4

Usage metrics

Categories

Keywords

Licence

Exports