Efficient Matrix-aware Relational Query Processing in Big Data Systems

Yu, Yongyang

doi:10.25394/PGS.7543103.v1

Yongyang Yu.pdf (1.33 MB)

Efficient Matrix-aware Relational Query Processing in Big Data Systems

thesis

posted on 2019-01-03, 13:54 authored by Yongyang YuYongyang Yu

In the big data era, the use of large-scale machine learning methods is becoming ubiquitous in data exploration tasks ranging from business intelligence and bioinformatics to self-driving cars. In these domains, a number of queries are composed of various kinds of operators, such as relational operators for preprocessing input data, and machine learning models for complex analysis. Usually, these learning methods heavily rely on matrix computations. As a result, it is imperative to develop novel query processing approaches and systems that are aware of big matrix data and corresponding operators, scale to clusters of hundreds of machines, and leverage distributed memory for high-performance computation. This dissertation introduces and studies several matrix-aware relational query processing strategies, analyzes and optimizes their performance.

The first contribution of this dissertation is MatFast, a matrix computation system for efficiently processing and optimizing matrix-only queries in a distributed in-memory environment. We introduce a set of heuristic rules to rewrite special features of a matrix query for less memory footprint, and cost models to estimate the sparsity of sparse matrix multiplications, and to distribute the matrix data partitions among various compute workers for a communication-efficient execution. We implement and test the query processing strategies in an open-source distributed dataflow

engine (Apache Spark).

In the second contribution of this dissertation, we extend MatFast to MatRel, where we study how to efficiently process queries that involve both matrix and relational operators. We identify a series of equivalent transformation rules to rewrite a logical plan when both relational and matrix operations are present. We introduce selection, projection, aggregation, and join operators over matrix data, and propose optimizations to reduce computation overhead. We also design a cost model to distribute matrix data among various compute workers for communication-efficient

evaluation of relational join operations.

In the third and last contribution of this dissertation, we demonstrate how to leverage MatRel for optimizing complex matrix-aware relational query evaluation pipelines. Especially, we showcase how to efficiently learn model parameters for deep neural networks of various applications with MatRel, e.g., Word2Vec.

History

Degree Type

Doctor of Philosophy

Department

Computer Science

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Walid G. Aref

Additional Committee Member 2

Elisa Bertino

Additional Committee Member 3

Sunil Prabhakar

Additional Committee Member 4

Alex Pothen

Usage metrics

Keywords

big data query optimization matrix computation distributed computing Applied Computer Science

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Efficient Matrix-aware Relational Query Processing in Big Data Systems

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Additional Committee Member 2

Additional Committee Member 3

Additional Committee Member 4

Usage metrics

Categories

Keywords

Licence

Exports