Purdue University Graduate School
Qiyuan_PhD_dissertation.pdf (21.32 MB)


Download (21.32 MB)
posted on 2023-04-21, 15:17 authored by Qiyuan ZhaoQiyuan Zhao

Automated reaction prediction has the potential to elucidate complex reaction networks for many applications in chemical engineering, including materials degradation, drug design, combustion chemistry and biomass conversion. Unlike traditional reaction mechanism elucidation methods that rely on manual setup of quantum chemistry calculations, automated reaction prediction avoids tedious trial-and-error learning processes and greatly reduces the risk of leaving out important reactions. Despite these promising advantages, the potential of automated reaction prediction as a general-purpose tool is still largely unrealized, due to high computational cost and inconsistent reaction coverage. Therefore, this dissertation develops methods to simultaneously reduce the computational cost and increase the reaction coverage. Specifically, the computational cost is reduced by the development of more efficient transition state (TS) localization workflows and fast molecular and reaction property prediction packages, while the reaction coverage is increased by a comprehensive reaction space exploration based on mathematically defined elementary reaction steps. These components are implemented in two open-source packages, one is TAFFI (Topology Automated Force-Field Interactions) component increment theory (TCIT) and the other is Yet Another Reaction Program (YARP).

The first package, TCIT, is the first component increment theory based molecular property prediction package. TCIT is based on the locality assumption, which decomposes molecular thermochemistry properties into the summation of the contributions of each subgraph. In contrast to the traditional "group" increment theory, TCIT treats each subgraph as the central atom plus its nearest and next-nearest neighboring atoms, and consistently parameterizes the contribution of each component according to purely quantum chemistry calculations. Although all parameterizations are based on quantum chemical calculations, when benchmarked against experimental data, TCIT provides more accurate predictions compared to traditional methods using the same experimental dataset for parameterization. With TCIT, the molecular properties (e.g., enthalpy of formation) and reaction properties (e.g., enthalpy of reaction) can be accurately predicted in an on-the-fly manner. The second package, YARP, is developed for automated reaction space exploration and deep reaction network prediction. By optimizing the reaction enumeration, geometry initialization, and transition state convergence algorithms that are common to many prediction methodologies, YARP (re)discovers both established and unreported reaction pathways and products while simultaneously reducing the cost of reaction characterization nearly 100-fold and increasing convergence of transition states, comparing with recent benchmarks. In addition, an updated version of YARP, YARP v2.0, further reduces the cost of reaction characterization from 100-fold to 300-fold, while increasing the reaction coverage beyond the scope of elementary reaction steps. This combination of ultra-low cost and high reaction-coverage creates opportunities to explore the reactivity of larger systems and more complex reaction networks for applications like chemical degradation, where computational cost is a bottleneck.

The power of TCIT and YARP has been demonstrated by a broad range of applications. In the first application, YARP was used to explore the reactivity of unimolecular and bimolecular reactants, comprising a total of 581 reactions involving 51 distinct reactants. The algorithm discovered all established reaction pathways, where such comparisons are possible, while also revealing a much richer reactivity landscape, including lower barrier reaction pathways and a strong dependence of reaction conformation in the apparent barriers of the reported reactions. Secondly, YARP was applied to the search for prebiotic chemical pathways, which is a long-standing puzzle that has generated a menagerie of competing hypotheses with limited experimental prospects for falsification. With YARP, the space of organic molecules that can be formed within four polar or pericyclic reactions from water and hydrogen cyanide (HCN) was comprehensively explored. A surprisingly diverse reactivity landscape was revealed within just a few steps of these simple molecules and reaction pathways to several biologically relevant molecules were discovered involving lower activation energies and fewer reaction steps compared with recently proposed alternatives. In the third application, predicting the reaction network of glucose pyrolysis, YARP generated by far the largest and most complex reaction network in the domain of biomass pyrolysis and discovered many unexpected reaction mechanisms. Further, motivated by the fact that existing reaction transition state (TS) databases are comparatively small and lack chemical diversity, YARP, together with the concept of a graphically defined model reaction, were utilized to address the data gap by comprehensively characterizing a reaction space associated with C, H, O, and N containing molecules with up to 10 heavy (non-hydrogen) atoms. The resulting dataset, namely Reaction Graph Depth 1 (RGD1) dataset, is composed of 176,992 organic reactions possessing at least one validated TS, activation energy, enthalpy of reaction, reactant and product geometries, frequencies, and atom-mapping. The RGD1 dataset represents the largest and most chemically diverse TS dataset published to date and should find immediate use in developing novel machine learning models for predicting reaction properties. In addition to exploring the molecular reaction space, YARP was also extended to explore and characterize reaction networks in heterogeneous catalysis systems. With ethylene oligomerization on silica-supported single site Ga catalysts as a model system, YARP illustrates how a comprehensive reaction network can be generated by using only graph-based rules for exploring the network and elementary constraints based on activation energy and system size for identifying network terminations. The automated reaction exploration (re)discovered the Ga-alkyl-centered Cossee-Arlman mechanism that is hypothesized to drive major product formation while also predicting several new pathways for producing alkanes and coke precursors. The diverse scope of these applications and milestone quality of many of the reaction networks produced by YARP  illustrate that automated reaction prediction is approaching a general-purpose capability.


Purdue Process Safety and Assurance Center

Office of Naval Research


Degree Type

  • Doctor of Philosophy


  • Chemical Engineering

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Brett M. Savoie

Additional Committee Member 2

Jeffrey P. Greeley

Additional Committee Member 3

David S. Corti

Additional Committee Member 4

Gaurav Chopra

Usage metrics



    Ref. manager