Purdue University Graduate School
PurdueThesis_Zhiyi Tian.pdf (1.03 MB)

Clustering High-dimensional Noisy Categorical and Mixed Data

Download (1.03 MB)
posted on 2021-07-27, 16:15 authored by Zhiyi TianZhiyi Tian
Clustering is an unsupervised learning technique widely used to group data into homogeneous clusters. For many real-world data containing categorical values, existing algorithms are often computationally costly in high dimensions, do not work well on noisy data with missing values, and rarely provide theoretical guarantees on clustering accuracy. In this thesis, we propose a general categorical data encoding method and a computationally efficient spectral based algorithm to cluster high-dimensional noisy categorical (nominal or ordinal) data. Under a statistical model for data on m attributes from n subjects in r clusters with missing probability epsilon, we show that our algorithm exactly recovers the true clusters with high probability when mn(1-epsilon) >= CMr2 log3M, with M=max(n,m) and a fixed constant C. Moreover, we show that mn(1- epsilon)2 >= r *delta/2 with 0< delta <1 is necessary for any algorithm to succeed with probability at least (1+delta)/2. In case, where m=n and r is fixed, for example, the sufficient condition matches with the necessary condition up to a polylog(n) factor, showing that our proposed algorithm is nearly optimal. We also show our algorithm outperforms several existing algorithms in both clustering accuracy and computational efficiency, both theoretically and numerically. In addition, we propose a spectral algorithm with standardization to cluster mixed data. This algorithm is computationally efficient and its clustering accuracy has been evaluated numerically on both real world data and synthetic data.


Degree Type

  • Doctor of Philosophy


  • Management

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Jen Tang

Advisor/Supervisor/Committee co-chair

Jiaming Xu

Additional Committee Member 2

Yanjun Li

Additional Committee Member 3

Wei Sun

Usage metrics



    Ref. manager