Source code search for automatic bug localization

Akbar, Shayan Ali A

doi:10.25394/PGS.13366679.v1

A_thesis.pdf (3.25 MB)

Source code search for automatic bug localization

thesis

posted on 2020-12-14, 18:03 authored by Shayan Ali A AkbarShayan Ali A Akbar

This dissertation advances the state-of-the-art in information retrieval (IR) based automatic bug localization for large software systems. We present techniques from three generations of IR based bug localization and compare their performances on our large and diverse bug localization dataset --- the Bugzbook dataset. The three generations span over fifteen years of research in mining software repositories for bug localization and include: (1) the generation of simple bag-of-words (BoW) based techniques, (2) the generation in which software-centric information such as bug and code change histories as well as structured information embedded in bug reports and code files are exploited to improve retrieval, and (3) the third and most recent generation in which order and semantic relationships between terms are modeled to improve the performance of bug localization systems. The dissertation also presents a novel technique called SCOR (Source Code Retrieval with Semantics and Order) which combines Markov Random Fields (MRF) based term-term ordering dependencies with semantic word vectors obtained from neural network based word embedding algorithms, such as word2vec, to better localize bugs in code files. The results presented in this dissertation show that while term-term ordering and semantic relationships significantly improve the performance when they are modeled separately in retrieval systems, the best precisions in retrieval are obtained when they are modeled together in a single retrieval system. We also show that the semantic representations of software terms learned by training the word embedding algorithm on a corpus of software repositories can be used to perform search in new software code repositories not present in the training corpus of the word embedding algorithm.

History

Degree Type

Doctor of Philosophy

Department

Electrical and Computer Engineering

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Avinash Kak

Additional Committee Member 2

Charles Bouman

Additional Committee Member 3

Arif Ghafoor

Additional Committee Member 4

Samuel Midkiff

Usage metrics

Keywords

code search text embeddings information retrieval bug localization Word2vec Computer Engineering

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Source code search for automatic bug localization

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Additional Committee Member 2

Additional Committee Member 3

Additional Committee Member 4

Usage metrics

Categories

Keywords

Licence

Exports