Text Normalization based on Error Type using Pre-trained Language Model
With the emergence of Social media and its growing popularity, there has been substantial growth in User Generated Content (UGC), which holds great potential in extracting meaningful information. Due to the dynamic nature of social media contents, many Natural Language Processing (NLP) systems have suffered from performance degradation due to the original intention in development for application to standard data. To resolve this significant drop in performance, normalization of non-standard data was introduced as a pre-processing step for processing non-standard texts before being applied to these downstream tasks. This thesis focuses on investigating the incorporation of the pre-trained language model BERT in normalization and the varying performance of normalization methods based on different types of errors. In this study, the BERT model is used for the candidate generation of normalization and simple ranking methods are further applied for the candidate ranking on the normalization candidates generated through BERT. The candidate generation performance of BERT and the ranking performance of different methods are investigated based on the different types of errors.