A-OPTIMAL SUBSAMPLING FOR BIG DATA GENERAL ESTIMATING EQUATIONS

Cheung, Chung Ching

doi:10.25394/PGS.8986571.v1

Purdue_University_Thesis_Chung_Ching_Cheung.pdf (1.61 MB)

A-OPTIMAL SUBSAMPLING FOR BIG DATA GENERAL ESTIMATING EQUATIONS

thesis

posted on 2019-08-13, 16:56 authored by Chung Ching CheungChung Ching Cheung

A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.

History

Degree Type

Doctor of Philosophy

Department

Mathematics

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Dr. Hanxiang Peng

Advisor/Supervisor/Committee co-chair

Dr. Leonid Rubchinsky

Additional Committee Member 2

Dr. Benzion Boukai

Additional Committee Member 3

Dr. Guang Lin

Additional Committee Member 4

Dr. Mohammad AL Hasan

Usage metrics

Keywords

subsampling general estimating equations a-optimality big data High Dimensional Data Statistics

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

A-OPTIMAL SUBSAMPLING FOR BIG DATA GENERAL ESTIMATING EQUATIONS

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Advisor/Supervisor/Committee co-chair

Additional Committee Member 2

Additional Committee Member 3

Additional Committee Member 4

Usage metrics

Categories

Keywords

Licence

Exports