Clustering High-dimensional Noisy Categorical and Mixed Data

Tian, Zhiyi

doi:10.25394/PGS.15058008.v1

PurdueThesis_Zhiyi Tian.pdf (1.03 MB)

Clustering High-dimensional Noisy Categorical and Mixed Data

thesis

posted on 2021-07-27, 16:15 authored by Zhiyi TianZhiyi Tian

Clustering is an unsupervised learning technique widely used to group data into homogeneous clusters. For many real-world data containing categorical values, existing algorithms are often computationally costly in high dimensions, do not work well on noisy data with missing values, and rarely provide theoretical guarantees on clustering accuracy. In this thesis, we propose a general categorical data encoding method and a computationally efficient spectral based algorithm to cluster high-dimensional noisy categorical (nominal or ordinal) data. Under a statistical model for data on m attributes from n subjects in r clusters with missing probability epsilon, we show that our algorithm exactly recovers the true clusters with high probability when mn(1-epsilon) >= CMr² log³M, with M=max(n,m) and a fixed constant C. Moreover, we show that mn(1- epsilon)² >= r *delta/2 with 0< delta <1 is necessary for any algorithm to succeed with probability at least (1+delta)/2. In case, where m=n and r is fixed, for example, the sufficient condition matches with the necessary condition up to a polylog(n) factor, showing that our proposed algorithm is nearly optimal. We also show our algorithm outperforms several existing algorithms in both clustering accuracy and computational efficiency, both theoretically and numerically. In addition, we propose a spectral algorithm with standardization to cluster mixed data. This algorithm is computationally efficient and its clustering accuracy has been evaluated numerically on both real world data and synthetic data.

History

Degree Type

Doctor of Philosophy

Department

Management

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Jen Tang

Advisor/Supervisor/Committee co-chair

Jiaming Xu

Additional Committee Member 2

Yanjun Li

Additional Committee Member 3

Wei Sun

Usage metrics

Keywords

Clustering Categorical values Noisy and sparse data High dimensions Spectral algorithm Clustering accuracy Statistics

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Clustering High-dimensional Noisy Categorical and Mixed Data

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Advisor/Supervisor/Committee co-chair

Additional Committee Member 2

Additional Committee Member 3

Usage metrics

Categories

Keywords

Licence

Exports