Purdue University Graduate School
Browse

CHEMICAL SPACE INVADERS: ENHANCING EXPLORATION OF MODULARLY CONSTRUCTED CHEMICAL SPACES USING CONTEXT AWARE AI AGENTS

thesis
posted on 2024-10-10, 15:31 authored by Matthew MuhoberacMatthew Muhoberac

Chemical science can be imagined as a universe of information in which individual galaxies, solar systems, stars, and planets are compounds, reactions, biomolecules, etc. which need to be discovered, researched, and documented. The problem with this is that the universe of chemical science is potentially vaster than the one in which we live, and we are exploring it in a relatively inefficient manner. There is a scene in one of my favorite television shows, Futurama, which paints a picture of traditional chemical exploration. Taking place in the 30th century, the main character Fry loses his robot friend Bender in outer space and resorts to using a giant telescope in the Himalayan mountains to randomly search through points in space to try to find him. After days of searching nonstop, he gives up noting that it is an impossible task because space is so vast in size, and he is searching so inefficiently. While human exploration of chemistry may not be as inefficient, there are a lot of steps which are driven by trial and error and educated guesswork which ultimately introduce major inefficiencies into scientific discovery. While we don’t live in the 30th century yet, we do have access to 21stcentury technology which can assist in exploring chemistry in a more directed manner. This mainly involves using machine learning, search algorithms, and generative powered exploratory AI to serve as a force multiplier which can serve to assist human chemists in chemical exploration. To shamelessly compare this with another space-based sci-fi reference, this would be akin to deploying hundreds or thousands of automated space probes to search unexplored planets, akin to how the empire found the rebellion on Hoth in the Empire Strikes Back.

The journey to integrate AI with chemical exploration starts with the important concept of standardization and how to apply it to chemically relevant data. To easily organize, store, and access relevant aspects of small molecules, macromolecules, chemical reactions, biological assays, etc. it is imperative that data be represented in a standard format which accurately portrays necessary chemical information. This becomes especially relevant as humans aggregate more and more chemical data. In this thesis, we tackle a subset of standardization in Chapter 2 involving benchmarking sets for comparative evaluation of docking software. One major reason why standardization is so important is that standardization promotes ease of access to relevant data, regardless of if this access is attempted by human or computational means. While improving data access for humans is beneficial, computationally it is a game changer when datamining training data for machine learning (ML) applications. Having standardized data readily available for computational access allows for software to rapidly access and preprocess relevant data boosts efficiency in ML model training. In Chapter 4 of this thesis, the central database of the CIPHER close-loop system is standardized and integrated with a REST API, allowing for rapid data acquisition via a structured URL call. Having database standardization and a mechanism for easy data mining makes a database “ML ready” and promotes the database for ML applications.

Build upon data standardization and training ML models for chemical applications, the next step of this journey revolves around a concept known as a “chemical space” and how chemists can approximate and sample chemical spaces in a directed manner. In the context of this thesis, a chemical space can be visualized in the following manner. Start by envisioning any chemical relationship between some inputs and outputs as an unknown mathematical function. For example, if one is measuring the assay response of a specific drug at a certain concentration, the input would be the concentration, and the output would be the assay response. Then the bounds of this space are set by determining the range of input values and this forms a chemical space which corresponds to the chemical problem. Chemists sample these spaces every day when they go into the lab, run experiments, and analyze their data. While the example described above is relatively simple in scope, even if the relationship is very complex techniques such as ML can be used to approximate the relationship. An example of this approximation is shown in Chapter 3 of this thesis, where normalizing flow architecture is used to bias a vector space representation of molecules with chemical properties, creating a space which correlates compound and property and can be sampled to provided compounds with specific values of trained chemical properties. Training individual models is important, but to truly emulate certain chemical processes multiple models may need to be combined with physical instrumentation to efficiently sample and validate a chemical space. Chapter 4 of this thesis expands upon this concept by integrating a variety of ML modules with high-throughput (HT) bioassay instrumentation to create a “close loop” system designed around discovering, synthesizing, and validating non-addictive analgesics.

The final step of this journey is to integrate these systems which sample chemical spaces with AI, allowing for automated exploration of these spaces in a directed manner. There are several AI frameworks which can be used separately or combined to accomplish this task, but the framework that is the focus of this thesis is AI agents. AI agents are entities which use some form of AI to serve as a logical processing center which drives their exploration through a problem space. This can be a simple algorithm, some type of heuristic model, or an advance form of generative AI such as an LLM. Additionally, these agents generally have access to certain tools which serve as a medium for interaction with physical or computational environments, such as controlling a robotic arm or searching a database. Finally, these agents generally have a notion of past actions and observations, commonly referred to as memory, which allows agents to recall important information as they explore. Chapter 5 of this thesis details a custom agentic framework which is tailored towards complex scientific applications. This framework builds agents from source documentation around a specific user defined scope, provides them with access to literature and documentation in the form of embeddings, has custom memory for highly targeted retention, and allows form agents to communicate with one another to promote collaborative problem solving. Chapter 6 of this thesis showcase an application of a simpler agentic framework to an automated lipidomic workflow which performs comparative analysis on 5xFAD vs. WT mice brain tissue. The group of AI agents involved in this system generate mass spectrometry worklists, filter data into categories for analysis, perform comparative analysis, and allow for the user to dynamically create plots which can be used to answer specific statistical questions. In addition to performing all these operational and statistical analysis functions, the system includes an agent which uses document embeddings trained on curated technical manuals and protocols to answer user questions via a chatbot style interface. Overall, the system showcases how AI can effectivity be applied to relevant chemical problems to enhance speed, bolster accuracy, and improve usability.

Funding

Development of A Specialized Platform for Innovative Research Exploration (ASPIRE)

National Center for Advancing Translational Sciences

Find out more...

Chemical instruments-aware distributed blockchain based open AI platform to accelerate drug discovery

National Center for Advancing Translational Sciences

Find out more...

Cancer Center Support Grant

National Cancer Institute

Find out more...

History

Degree Type

  • Doctor of Philosophy

Department

  • Chemistry

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Gaurav Chopra

Additional Committee Member 2

Herman O Sintim

Additional Committee Member 3

Ananth Y Grama

Additional Committee Member 4

Ming Chen