Purdue University Graduate School
Browse

Discourse Relation Dataset on Hands-on Engineering Tutorial Monologue Video Scripts Based on PDTB3 Scheme

Download (496.13 kB)
dataset
posted on 2025-05-08, 16:24 authored by Rajasekhar KakarlaRajasekhar Kakarla

My research focuses on developing a specialized discourse relation dataset from hands-on engineering tutorial monologue video transcripts, based on the Penn Discourse Treebank 3.0 (PDTB-3) annotation framework. The core objective of this work is to systematically analyze and annotate how knowledge is structured and conveyed in technical instructional videos, which are a growing resource for self-guided learning in engineering domains.

The study involves collecting transcripts from real-world engineering tutorials such as those on plumbing, welding, electrical wiring, and HVAC systems and annotating them with discourse relations that capture the logical, temporal, and causal connections between instructional steps. Using the PDTB-3 taxonomy, I categorized these relations into senses such as Contingency.Cause, Temporal.Synchronous, Expansion.Instantiation, and Comparison.Contrast, among others.

What sets this research apart is its domain specificity and its incorporation of challenging discourse phenomena, including implicit connectives, non-adjacent argument spans, alternative lexicalizations (AltLex), and compound relations. These features reflect the nuanced ways expert instructors communicate procedural knowledge. To support this, I implemented a two-stage annotation process first, AI-assisted segmentation , followed by manual refinement and inter-annotator agreement evaluation.

Additionally, the dataset was used to benchmark the discourse annotation capabilities of several Large Language Models (LLMs), such as ChatGPT-4, Claude 3 Sonnet, Gemini 2.0, and Grok 3.0. Metrics like accuracy and F1-score were used to compare their performance against human annotations.

Overall, the research aims to contribute a high-quality resource that can be used to improve automated discourse understanding in technical education contexts. It bridges gaps in existing discourse datasets by targeting domain-specific, spoken-form, procedural knowledge—making it highly relevant for applications in AI-based tutoring systems, expert knowledge transfer, and educational content analysis.

History

Degree Type

  • Master of Science in Industrial Technology

Department

  • Computer and Information Technology

Campus location

  • Hammond

Advisor/Supervisor/Committee Chair

Chneg Zhang

Additional Committee Member 2

Ashok Vardhan Raja

Additional Committee Member 3

Afshin Zahraee

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC