Purdue University Graduate School
Thesis_YoulimKo_042921.pdf (4.07 MB)

Text Normalization based on Error Type using Pre-trained Language Model

Download (4.07 MB)
posted on 2021-04-29, 18:39 authored by You Lim KoYou Lim Ko

With the emergence of Social media and its growing popularity, there has been substantial growth in User Generated Content (UGC), which holds great potential in extracting meaningful information. Due to the dynamic nature of social media contents, many Natural Language Processing (NLP) systems have suffered from performance degradation due to the original intention in development for application to standard data. To resolve this significant drop in performance, normalization of non-standard data was introduced as a pre-processing step for processing non-standard texts before being applied to these downstream tasks. This thesis focuses on investigating the incorporation of the pre-trained language model BERT in normalization and the varying performance of normalization methods based on different types of errors. In this study, the BERT model is used for the candidate generation of normalization and simple ranking methods are further applied for the candidate ranking on the normalization candidates generated through BERT. The candidate generation performance of BERT and the ranking performance of different methods are investigated based on the different types of errors.


Degree Type

  • Master of Science


  • Computer and Information Technology

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Eric T. Matson

Advisor/Supervisor/Committee co-chair

Baijian Yang

Additional Committee Member 2

Julia M. Rayz

Usage metrics



    Ref. manager