Privacy-Enhancing Techniques for Data Analytics
Organizations today collect and aggregate huge amounts of data from individuals under various scenarios and for different purposes. Such aggregation of individuals’ data when combined with techniques of data analytics allows organizations to make informed decisions and predictions. But in many situations, different portions of the data associated with individuals are collected and curated by different organizations. To derive more accurate conclusions and predictions, those organization may want to conduct the analysis based on their joint data, which cannot be simply accomplished by each organization exchanging its own data with other organizations due to the sensitive nature of data. Developing approaches for collaborative privacy-preserving data analytics, however, is a nontrivial task. At least two major challenges have to be addressed. The first challenge is that the security of the data possessed by each organization should always be properly protected during and after the collaborative analysis process, whereas the second challenge is the high computational complexity usually accompanied by cryptographic primitives used to build such privacy-preserving protocols.
In this dissertation, based on widely adopted primitives in cryptography, we address the aforementioned challenges by developing techniques for data analytics that
not only allow multiple mutually distrustful parties to perform data analysis on their
joint data in a privacy-preserving manner, but also reduce the time required to complete the analysis. More specifically, using three common data analytics tasks as
concrete examples, we show how to construct the respective privacy-preserving protocols under two different scenarios: (1) the protocols are executed by a collaborative process only involving the participating parties; (2) the protocols are outsourced to
some service providers in the cloud. Two types of optimization for improving the
efficiency of those protocols are also investigated. The first type allows each participating party access to a statistically controlled leakage so as to reduce the amount
of required computation, while the second type utilizes the parallelism that could
be incorporated into the task and pushes some computation to the offline phase to
reduce the time needed for each participating party without any additional leakage.
Extensive experiments are also conducted on real-world datasets to demonstrate the
effectiveness of our proposed techniques.