Clustering High-dimensional Noisy Categorical and Mixed Data

Reason: Going to publish this work on journals.

1

year(s)

10

month(s)

8

day(s)

until file(s) become available

Clustering High-dimensional Noisy Categorical and Mixed Data

thesis
posted on 27.07.2021, 16:15 by Zhiyi TianZhiyi Tian
Clustering is an unsupervised learning technique widely used to group data into homogeneous clusters. For many real-world data containing categorical values, existing algorithms are often computationally costly in high dimensions, do not work well on noisy data with missing values, and rarely provide theoretical guarantees on clustering accuracy. In this thesis, we propose a general categorical data encoding method and a computationally efficient spectral based algorithm to cluster high-dimensional noisy categorical (nominal or ordinal) data. Under a statistical model for data on m attributes from n subjects in r clusters with missing probability epsilon, we show that our algorithm exactly recovers the true clusters with high probability when mn(1-epsilon) >= CMr2 log3M, with M=max(n,m) and a fixed constant C. Moreover, we show that mn(1- epsilon)2 >= r *delta/2 with 0< delta <1 is necessary for any algorithm to succeed with probability at least (1+delta)/2. In case, where m=n and r is fixed, for example, the sufficient condition matches with the necessary condition up to a polylog(n) factor, showing that our proposed algorithm is nearly optimal. We also show our algorithm outperforms several existing algorithms in both clustering accuracy and computational efficiency, both theoretically and numerically. In addition, we propose a spectral algorithm with standardization to cluster mixed data. This algorithm is computationally efficient and its clustering accuracy has been evaluated numerically on both real world data and synthetic data.

History

Degree Type

Doctor of Philosophy

Department

Management

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Jen Tang

Advisor/Supervisor/Committee co-chair

Jiaming Xu

Additional Committee Member 2

Yanjun Li

Additional Committee Member 3

Wei Sun

Usage metrics

Categories

Licence

Exports