Biomedical Concept Association and Clustering Using Word Embeddings

Shah, Setu

doi:10.25394/PGS.7396886.v1

Thesis v4.4.pdf (7.8 MB)

Biomedical Concept Association and Clustering Using Word Embeddings

thesis

posted on 2019-02-12, 18:09 authored by Setu ShahSetu Shah

Biomedical data exists in the form of journal articles, research studies, electronic health records, care guidelines, etc. While text mining and natural language processing tools have been widely employed across various domains, these are just taking off in the healthcare space.

A primary hurdle that makes it difficult to build artificial intelligence models that use biomedical data, is the limited amount of labelled data available. Since most models rely on supervised or semi-supervised methods, generating large amounts of pre-processed labelled data that can be used for training purposes becomes extremely costly. Even for datasets that are labelled, the lack of normalization of biomedical concepts further affects the quality of results produced and limits the application to a restricted dataset. This affects reproducibility of the results and techniques across datasets, making it difficult to deploy research solutions to improve healthcare services.

The research presented in this thesis focuses on reducing the need to create labels for biomedical text mining by using unsupervised recurrent neural networks. The proposed method utilizes word embeddings to generate vector representations of biomedical concepts based on semantics and context. Experiments with unsupervised clustering of these biomedical concepts show that concepts that are similar to each other are clustered together. While this clustering captures different synonyms of the same concept, it also captures the similarities between various diseases and the symptoms that those diseases are symptomatic of.

To test the performance of the concept vectors on corpora of documents, a document vector generation method that utilizes these concept vectors is also proposed. The document vectors thus generated are used as an input to clustering algorithms, and the results show that across multiple corpora, the proposed methods of concept and document vector generation outperform the baselines and provide more meaningful clustering. The applications of this document clustering are huge, especially in the search and retrieval space, providing clinicians, researchers and patients more holistic and comprehensive results than relying on the exclusive term that they search for.

At the end, a framework for extracting clinical information that can be mapped to electronic health records from preventive care guidelines is presented. The extracted information can be integrated with the clinical decision support system of an electronic health record. A visualization tool to better understand and observe patient trajectories is also explored. Both these methods have potential to improve the preventive care services provided to patients.

History

Degree Type

Master of Science in Electrical and Computer Engineering

Department

Electrical and Computer Engineering

Campus location

Indianapolis

Advisor/Supervisor/Committee Chair

Dr. Xiao Luo

Advisor/Supervisor/Committee co-chair

Dr. Mohamed El-Sharkawy

Additional Committee Member 2

Dr. Brian King

Usage metrics

Keywords

biomedical science Natural language processsing Word embeddings Artificial intelligence document clustering preventive care Computer Engineering Natural Language Processing Artificial Intelligence and Image Processing

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Biomedical Concept Association and Clustering Using Word Embeddings

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Advisor/Supervisor/Committee co-chair

Additional Committee Member 2

Usage metrics

Categories

Keywords

Licence

Exports