Purdue University Graduate School
Browse
thesis.pdf (2.71 MB)

Compilation Techniques, Algorithms, and Data Structures for Efficient and Expressive Data Processing Systems

Download (2.71 MB)
The proliferation of digital data, driven by factors like social media, e-commerce, etc., has created an increasing demand for highly processed data at higher levels of fidelity, which puts increasing demands on modern data processing systems. In the past, data processing systems faced bottlenecks due to limited main memory availability. However, as main memory becomes more abundant, their optimization focus has shifted from disk I/O to optimized computation through techniques like compilation. This dissertation addresses several critical limitations within such compilation-based data processing systems.

In modern data analytics pipelines, combination of workloads from various paradigms, such as traditional DBMS and Machine Learning, is common.
These pipelines are typically managed by specialized systems designed for specific workload types. While these specialized systems optimize their individual performance, substantial performance loss occurs when they are combined to handle mixed workloads. This loss is mainly due to overheads at system boundaries, including data copying and format conversions, as well as the general inability to perform cross-system optimizations.

This dissertation tackles this problem in two angles. First, it proposes an efficient post-hoc integration of individual systems using generative programming via the construction of common intermediate layers. This approach preserves the best-of-breed performance of individual workloads while achieving state-of-the-art performance for combined workloads. Second, we introduce a high-level query language capable of expressing various workload types, acting as a general substrate to implement combined workloads. This allows the generation of optimized code for end-to-end workloads through
the construction of an intermediate representation (IR).

The dissertation then shifts focus to data processing systems used for incremental view maintenance (IVM). While existing IVM systems achieve high performance through compilation and novel algorithms, they have limitations in handling specific query classes. Notably, they are incapable of handling queries involving correlated nested aggregate subqueries. To address this, our work proposes a novel indexing scheme based on a new data structure and a corresponding set of algorithms that fully incrementalize such queries. This approach result in substantial asymptotic speedups and order-of-magnitude performance improvements for workloads of practical importance.

Finally, the dissertation explores efficient and expressive fixed-point computations, with a focus on Datalog--a language widely used for declarative program analysis. Although existing Datalog engines rely on compilation and specialized code generation to achieve performance, they lack the flexibility to support extensions required for complex program analysis. Our work introduces a new Datalog engine built using generative programming techniques that offers both flexibility and state-of-the-art performance through specialized code generation.

Funding

CAREER: Generative Programming and DSLs for Safe Performance Critical Systems

Directorate for Computer & Information Science & Engineering

Find out more...

SHF: Medium: Collaborative Research: From Volume to Velocity: Big Data Analytics in Near-Realtime

Directorate for Computer & Information Science & Engineering

Find out more...

FMitF: Track I: Symbolic Reasoning with Graph Networks

Directorate for Computer & Information Science & Engineering

Find out more...

III: Small: Native Compilation, Query Processing, and Indexing for In-memory Graph Relational Data Systems

Directorate for Computer & Information Science & Engineering

Find out more...

Program Generators for Exascale and Beyond

Office of Advanced Scientific Computing Research

Find out more...

History

Degree Type

  • Doctor of Philosophy

Department

  • Computer Science

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Dr. Tiark Rompf

Additional Committee Member 2

Dr. Walid Aref

Additional Committee Member 3

Dr. Milind Kulkarni

Additional Committee Member 4

Dr. Yongle Zhang