Vast amounts of sensitive personal information are collected by companies, institutions and governments. A key technological challenge is how to effectively extract knowledge from data while preserving the privacy of the individuals involved. In this dissertation, we address this challenge from the perspective of privacy-preserving data collection and analysis. We focus on investigation of a technique called local differential privacy (LDP) and studied several aspects of it.
In particular, the thesis serves as a comprehensive study of multiple aspects of the LDP field. We investigated the following seven problems: (1) We studied LDP primitives, i.e., the basic mechanisms that are used to build LDP protocols. (2) We then studied the problem when the domain size is very big (e.g., larger than $2^{32$), where finding the values with high frequency is a challenge, because one needs to enumerate through all values. (3) Another interesting setting is when each user possesses a set of values, instead of a single private value. (4) With the basic problems visited, we then aim to make the LDP protocols practical for real-world scenarios. We investigated the case where each user's data is high-dimensional (e.g., in the census survey, each user has multiple questions to answer), and the goal is to recover the joint distribution among the attributes. (5) We also built a system for companies to issue SQL queries over the data protected under LDP, where each user is associated with some public weights and holds some private values; an LDP version of the values is sent to the server from each user. (6) To further increase the accuracy of LDP, we study how to add post-processing steps to protocols to make them consistent while achieving high accuracy for a wide range of tasks, including frequencies of individual values, frequencies of the most frequent values, and frequencies of subsets of values. (7) Finally, we investigate a different model of LDP which is called the shuffler model. While users still use LDP algorithms to report their sensitive data, now there exists a semi-trusted shuffler that shuffles the users' reports and then send them to the server. This model provides better utility but at the cost of requiring more trust that the shuffler should not collude with the server.
Funding
EAGER: Bridging The Gap between Theory and Practice in Data Privacy
Directorate for Computer & Information Science & Engineering