Abstract
Human beings are exposed to many chemicals through their routine interactions with the environment, such as food/drug consumption, household or workplace activities, industrial or transportation activities, and even common environmental processes. Once absorbed, these chemicals are usually further biologically transformed into metabolites. Hence it is important to understand and predict the metabolism of those endogenous chemicals in our body. We decompose this in silico metabolism prediction task into three subtasks: given a compound m and a specific metabolizing enzyme α, (1) predicting whether m is a substrate of α, (2) if so, predicting what part of m is changed (here, the “bond of metabolism”) and (3) predicting the resulting terminal metabolite. This dissertation addresses the first two of these subtasks, for the nine most important human cytochrome P450 (CYP450) enzymes – CYP1A2, CYP2A6, CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, CYP2E1, CYP3A4. (1) Given an arbitrary molecule m and one of these nine CYP450 enzymes α, CypReact accurately predicts whether m will react with α. On a dataset of 1632 molecules, CypReact’s (cross-validation) AUROCs (area under the receiver operating characteristic curves) vary from 0.83 to 0.92. (2) Given one of the nine enzymes α and its substrate m, CypBoMη−η accurately predicts where m is metabolized by α – which of its η-η bonds (each a bond between two non-Hydrogen atoms) is a “bond of metabolism”. Over a dataset of 679 compounds, CypBoMη−η’s (cross-validation) Jaccard scores ranged from 0.401 to 0.594. Our empirical studies, on datasets disjoint from our training sets, demonstrated that CypReact and CypBoMη−η performed significantly better than related tools (eg, ADMET Predictor and Meteor Nexus), over several evaluation metrics, such as Jaccard score and MCC (Matthews correlation coefficient). As both tools are freely available, we anticipate many future researchers and developers will use them to better understand human metabolism.