Grades serve as one of the primary indicators of student learning, directing subsequent actions for students, instructors, and administrators, alike. Therefore, grade validity—that is, the extent to which grades communicate a meaningful and credible representation of what they purport to measure—is of utmost importance. However, a grade cannot be valid if one cannot trust that it will consistently and reliably result in the same value, regardless of who makes a measure or when they make it. Unfortunately, such reliability becomes increasingly challenging to achieve with larger class sizes, especially when utilizing multiple evaluators, as is often the case with mandatory introductory courses at large universities. Reliability suffers further when evaluating open-ended tasks, as are prevalent in authentic, high-quality engineering coursework.

This study explores grading reliability in the context of a large, multi-section engineering course. Recognizing the number of people involved and the plethora of activities that affect grading outcomes, the study adopts a systems approach to conduct a human reliability analysis using the Functional Resonance Analysis Method. Through this method, a collection of data sources, including course materials and observational interviews with undergraduate teaching assistant graders, are synthesized to produce a general model for how actions vary and affect subsequent actions within the system under study. Using a course assignment and student responses, the model shows how differences in contextual variables affect expected actions within the system. Next, the model is applied to each of the observational interviews with undergraduate teaching assistants to demonstrate how these actions occur in practice and to compare graders to one another and with expected behaviors. These results are further related to the agreement in system outcomes, or grades, assigned by each grader to guide analysis of how actions within the system affect its outcome.

The results of this study connect and elaborate upon previous models of grader cognition by analyzing the phenomenon in engineering, a previously unexplored context. The model presented can be easily generalized and adapted to smaller systems with fewer actors to understand sources of variability and potential threats to outcome reliability. The analysis of observed outcome instantiations guides a set of recommendations for minimizing grading variability.