Dissertation.pdf (2.22 MB)
Download file

A Systems Approach to Rule-Based Data Cleaning

Download (2.22 MB)
posted on 2019-05-10, 16:10 authored by Amr H EbaidAmr H Ebaid
High quality data is a vital asset for several businesses and applications. With flawed data costing billions of dollars every year, the need for data cleaning is unprecedented. Many data-cleaning approaches have been proposed in both academia and industry. However, there are no end-to-end frameworks for detecting and repairing errors with respect to a set of heterogeneous data-quality rules.

Several important challenges exist when envisioning an end-to-end data-cleaning system: (1) It should deal with heterogeneous types of data-quality rules and interleave their corresponding repairs. (2) It can be extended by various data-repair algorithms to meet users' needs for effectiveness and efficiency. (3) It must support continuous data cleaning and adapt to inevitable data changes. (4) It has to provide user-friendly interpretable explanations for the detected errors and the chosen repairs.

This dissertation presents a systems approach to rule-based data cleaning that is generalized, extensible, continuous and explaining. This proposed system distinguishes between a programming interface and a core to address the above challenges. The programming interface allows the user to specify various types of data-quality rules that uniformly define and explain what is wrong with the data, and how to fix it. Handling all the rules as black-boxes, the core encapsulates various algorithms to holistically and continuously detect errors and repair data. The proposed system offers a simple interface to define data-quality rules, summarizes the data, highlights violations and fixes, and provides relevant auditing information to explain the errors and the repairs.


Degree Type

Doctor of Philosophy


Computer Science

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Walid G. Aref

Advisor/Supervisor/Committee co-chair

Ahmed K. Elmagarmid

Additional Committee Member 2

Mourad Ouzzani

Additional Committee Member 3

Sunil Prabhakar

Additional Committee Member 4

Christopher W. Clifton

Additional Committee Member 5

Jennifer Neville