NATURAL LANGUAGE PROCESSING-BASED AUTOMATED INFORMATION EXTRACTION FROM BUILDING CODES TO SUPPORT AUTOMATED COMPLIANCE CHECKING
Traditional manual code compliance checking process is a time-consuming, costly, and error-prone process that has many shortcomings (Zhang & El-Gohary, 2015). Therefore, automated code compliance checking systems have emerged as an alternative to traditional code compliance checking. However, computer software cannot directly process regulatory information in unstructured building code texts. To support automated code compliance checking, building codes need to be transformed to a computer-processable, structured format. In particular, the problem that most automated code compliance checking systems can only check a limited number of building code requirements stands out.
The transformation of building code requirements into a computer-processable, structured format is a natural language processing (NLP) task that requires highly accurate part-of-speech (POS) tagging results on building codes beyond the state of the art. To address this need, this dissertation research was conducted to provide a method to improve the performance of POS taggers by error-driven transformational rules that revise machine-tagged POS results. The proposed error-driven transformational rules fix errors in POS tagging results in two steps. First, error-driven transformational rules locate errors in POS tagging by their context. Second, error-driven transformational rules replace the erroneous POS tag with the correct POS tag that is stored in the rule. A dataset of POS tagged building codes, namely the Part-of-Speech Tagged Building Codes (PTBC) dataset (Xue & Zhang, 2019), was published in the Purdue University Research Repository (PURR). Testing on the dataset illustrated that the method corrected 71.00% of errors in POS tagging results for building codes. As a result, the POS tagging accuracy on building codes was increased from 89.13% to 96.85%.
This dissertation research was conducted to provide a new POS tagger that is tailored to building codes. The proposed POS tagger utilized neural network models and error-driven transformational rules. The neural network model contained a pre-trained model and one or more trainable neural layers. The neural network model was trained and fine-tuned on the PTBC (Xue & Zhang, 2019) dataset, which was published in the Purdue University Research Repository (PURR). In this dissertation research, a high-performance POS tagger for building codes using one bidirectional Long-short Term Memory (LSTM) Recurrent Neural Network (RNN) trainable layer, a BERT-Cased-Base pre-trained model, and 50 epochs of training was discovered. This model achieved 91.89% precision without error-driven transformational rules and 95.11% precision with error-driven transformational rules, outperforming the otherwise most advanced POS tagger’s 89.82% precision on building codes in the state of the art.
Other automated information extraction methods were also developed in this dissertation. Some automated code compliance checking systems represented building codes in logic clauses and used pattern matching-based rules to convert building codes from natural language text to logic clauses (Zhang & El-Gohary 2017). A ruleset expansion method that can expand the range of checkable building codes of such automated code compliance checking systems by expanding their pattern matching-based ruleset was developed in this dissertation research. The ruleset expansion method can guarantee: (1) the ruleset’s backward compatibility with the building codes that the ruleset was already able to process, and (2) forward compatibility with building codes that the ruleset may need to process in the future. The ruleset expansion method was validated on Chapters 5 and 10 of the International Building Code 2015 (IBC 2015). The Chapter 10 of IBC 2015 was used as the training dataset and the Chapter 5 of the IBC 2015 was used as the testing dataset. A gold standard of logic clauses was published in the Logic Clause Representation of Building Codes (LCRBC) dataset (Xue & Zhang, 2021). Expanded pattern matching-based rules were published in the dissertation (Appendix A). The expanded ruleset increased the precision, recall, and f1-score of the logic clause generation at the predicate-level by 10.44%, 25.72%, and 18.02%, to 95.17%, 96.60%, and 95.88%, comparing to the baseline ruleset, respectively.
Most of the existing automated code compliance checking research focused on checking regulatory information that was stored in textual format in building code in text. However, a comprehensive automated code compliance checking process should be able to check regulatory information stored in other parts, such as, tables. Therefore, this dissertation research was conducted to provide a semi-automated information extraction and transformation method for tabular information processing in building codes. The proposed method can semi-automatically detect the layouts of tables and store the extracted information of a table in a database. Automated code compliance checking systems can then query the database for regulatory information in the corresponding table. The algorithm’s initial implementation accurately processed 91.67 % of the tables in the testing dataset composed of tables in Chapter 10 of IBC 2015. After iterative upgrades, the updated method correctly processed all tables in the testing dataset.
NSF under Grant No. 1827733
- Doctor of Philosophy
- Construction Management Technology
- West Lafayette