Robust A-optimal Subsampling for Massive Data Robust Linear Regression

Tang, Ziting

doi:10.25394/PGS.11317724.v1

Dissertation_ZitingTang.pdf (1.29 MB)

Robust A-optimal Subsampling for Massive Data Robust Linear Regression

thesis

posted on 2019-12-05, 01:58 authored by Ziting TangZiting Tang

This thesis is concerned with massive data analysis via robust A-optimally efficient non-uniform subsampling. Motivated by the fact that massive data often contain outliers and that uniform sampling is not efficient, we give numerous sampling distributions by minimizing the sum of the component variances of the subsampling estimate. And these sampling distributions are robust against outliers. Massive data pose two computational bottlenecks. Namely, data exceed a computer’s storage space, and computation requires too long waiting time. The two bottle necks can be simultaneously addressed by selecting a subsample as a surrogate for the full sample and completing the data analysis. We develop our theory in a typical setting for robust linear regression in which the estimating functions are not differentiable. For an arbitrary sampling distribution, we establish consistency for the subsampling estimate for both fixed and growing dimension( as high dimensionality is common in massive data). We prove asymptotic normality for fixed dimension. We discuss the A-optimal scoring method for fast computing. We conduct large simulations to evaluate the numerical performance of our proposed A-optimal sampling distribution. Real data applications are also performed.

History

Degree Type

Doctor of Philosophy

Department

Mathematics

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Fei Tan

Advisor/Supervisor/Committee co-chair

Hanxiang Peng

Additional Committee Member 2

Jyotirmoy Sarkar

Additional Committee Member 3

Honglang Wang

Additional Committee Member 4

Guang Lin

Usage metrics

Keywords

Robust Linear Models Outliers A-optimality subsampling massive data Statistics

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Robust A-optimal Subsampling for Massive Data Robust Linear Regression

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Advisor/Supervisor/Committee co-chair

Additional Committee Member 2

Additional Committee Member 3

Additional Committee Member 4

Usage metrics

Categories

Keywords

Licence

Exports