File(s) under embargo
until file(s) become available
DISTRIBUTED NEAREST NEIGHBOR CLASSIFICATION WITH APPLICATIONS TO CROWDSOURCING
thesisposted on 26.07.2021, 04:09 authored by Jiexin DuanJiexin Duan
The aim of this dissertation is to study two problems of distributed nearest neighbor classification (DiNN) systematically. The first one compares two DiNN classifiers based on different schemes: majority voting and weighted voting. The second one is an extension of the DiNN method to the crowdsourcing application, which allows each worker data has a different size and noisy labels due to low worker quality. Both statistical guarantees and numerical comparisons are studied in depth.
The first part of the dissertation focuses on the distributed nearest neighbor classification in big data. The sheer volume and spatial/temporal disparity of big data may prohibit centrally processing and storing the data. This has imposed a considerable hurdle for nearest neighbor predictions since the entire training data must be memorized. One effective way to overcome this issue is the distributed learning framework. Through majority voting, the distributed nearest neighbor classifier achieves the same rate of convergence as its oracle version in terms of the regret, up to a multiplicative constant that depends solely on the data dimension. The multiplicative difference can be eliminated by replacing majority voting with the weighted voting scheme. In addition, we provide sharp theoretical upper bounds of the number of subsamples in order for the distributed nearest neighbor classifier to reach the optimal convergence rate. It is interesting to note that the weighted voting scheme allows a larger number of subsamples than the majority voting one.
The second part of the dissertation extends the DiNN methods to the application in crowdsourcing. The noisy labels in crowdsourcing data and different sizes of worker data will deteriorate the performance of DiNN methods. We propose an enhanced nearest neighbor classifier (ENN) to overcome this issue. Our proposed method achieves the same regret as its oracle version on the expert data with the same size. We also propose two algorithms to estimate the worker quality if it is unknown in practice. One method constructs the estimators for worker quality based on the denoised worker labels through applying kNN classifier on expert data. Unlike previous worker quality estimation methods, which have no statistical guarantee, it achieves the same regret as the ENN with observed worker quality. The other method estimates the worker quality iteratively based on ENN, and it works well without expert data required by most previous methods.