Abstract
A hierarchical clustering algorithm ꟷNIPALSTREEꟷ was developed that is able to analyze large data sets in high-dimensional space. The result can be displayed as a dendrogram. At each tree level the algorithm projects a data set via principle component analysis onto one dimension. The data set is sorted according to this one dimension and split at the median position. To avoid distortion of clusters at the median position, the algorithm identifies a potentially more suited split point left or right of the median. The procedure is recursively applied on the resulting subsets until the maximal distance between cluster members exceeds a user-defined threshold. The approach was validated in a retrospective screening study for angiotensin converting enzyme (ACE) inhibitors. The resulting clusters were assessed for their purity and enrichment in actives belonging to this ligand class. Enrichment was observed in individual branches of the dendrogram. In further retrospective virtual screening studies employing the MDL Drug Data Report (MDDR), COBRA, and the SPECS catalog, NIPALSTREE was compared with the hierarchical k-means clustering approach. Results show that both algorithms can be used in the context of virtual screening. Intersecting the result lists obtained with both algorithms improved enrichment factors while losing only few chemotypes.