Purdue University Graduate School
Browse

Complex Document Parsing with Vision Language Models

thesis
posted on 2024-12-17, 20:38 authored by Yifei HuYifei Hu

This thesis explores the application of vision language models (VLMs) on document layout analysis (DLA) and optical character recognition (OCR). For document layout analysis, we found that VLMs excel at detecting text areas by leveraging their understanding of textual content, rather than relying solely on visual features. This approach proves more robust than traditional object detection methods, particularly for text-rich images typical in document analysis tasks. In addressing OCR challenges, we identified a critical bottleneck: the lack of high-quality, document-level OCR datasets. To overcome this limitation, we developed a novel synthetic data generation pipeline. This pipeline utilizes Large Language Models to create OCR training data by rendering markdown source text into images. Our experiments show that VLMs trained on this synthetic data outperform models trained on conventional datasets. This research highlights the potential of VLMs in document understanding tasks and introduces an innovative approach to generating training data for OCR. Our findings suggest that leveraging the dual image-text understanding capabilities of VLMs, combined with strategically generated synthetic data, can significantly advance the state of the art in document layout analysis and OCR.

History

Degree Type

  • Doctor of Philosophy

Department

  • Computer and Information Technology

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Julia Rayz

Additional Committee Member 2

Baijian Yang

Additional Committee Member 3

Dominic Kao

Additional Committee Member 4

Tianyi Li

Additional Committee Member 5

Yan Cong

Additional Committee Member 6

Tatiana Ringenberg

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC