Center for Bioinformatics (ZBH), Department of Computer Science , Faculty of Mathematics, Informatics and Natural Sciences, Universität Hamburg , Hamburg , 20146 , Germany.
CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Laboratory of Informatics and Chemistry, Faculty of Chemical Technology , University of Chemistry and Technology Prague , 166 28 Prague 6 , Czech Republic.
J Chem Inf Model. 2019 Mar 25;59(3):1030-1043. doi: 10.1021/acs.jcim.8b00677. Epub 2019 Jan 25.
Assay interference caused by small molecules continues to pose a significant challenge for early drug discovery. A number of rule-based and similarity-based approaches have been derived that allow the flagging of potentially "badly behaving compounds", "bad actors", or "nuisance compounds". These compounds are typically aggregators, reactive compounds, and/or pan-assay interference compounds (PAINS), and many of them are frequent hitters. Hit Dexter is a recently introduced machine learning approach that predicts frequent hitters independent of the underlying physicochemical mechanisms (including also the binding of compounds based on "privileged scaffolds" to multiple binding sites). Here we report on the development of a second generation of machine learning models which now covers both primary screening assays and confirmatory dose-response assays. Protein sequence clustering was newly introduced to minimize the overrepresentation of structurally and functionally related proteins. The models correctly classified compounds of large independent test sets as (highly) promiscuous or nonpromiscuous with Matthews correlation coefficient (MCC) values of up to 0.64 and area under the receiver operating characteristic curve (AUC) values of up to 0.96. The models were also utilized to characterize sets of compounds with specific biological and physicochemical properties, such as dark chemical matter, aggregators, compounds from a high-throughput screening library, drug-like compounds, approved drugs, potential PAINS, and natural products. Among the most interesting outcomes is that the new Hit Dexter models predict the presence of large fractions of (highly) promiscuous compounds among approved drugs. Importantly, predictions of the individual Hit Dexter models are generally in good agreement and consistent with those of Badapple, an established statistical model for the prediction of frequent hitters. The new Hit Dexter 2.0 web service, available at http://hitdexter2.zbh.uni-hamburg.de , not only provides user-friendly access to all machine learning models presented in this work but also to similarity-based methods for the prediction of aggregators and dark chemical matter as well as a comprehensive collection of available rule sets for flagging frequent hitters and compounds including undesired substructures.
小分子引起的检测干扰仍然是早期药物发现的一个重大挑战。已经衍生出许多基于规则和基于相似性的方法,可以标记可能的“行为不良的化合物”、“不良分子”或“麻烦化合物”。这些化合物通常是聚集剂、反应性化合物和/或全分析干扰化合物(PAINS),其中许多是频繁命中者。Hit Dexter 是一种最近引入的机器学习方法,可独立于潜在的物理化学机制(包括基于“特权支架”的化合物与多个结合位点的结合)预测频繁命中者。在这里,我们报告了第二代机器学习模型的开发,该模型现在涵盖了初步筛选试验和确认剂量反应试验。新引入了蛋白质序列聚类,以最大限度地减少结构和功能相关蛋白的过度表示。这些模型正确地将大型独立测试集的化合物分类为(高度)混杂或非混杂,Matthews 相关系数(MCC)值高达 0.64,接收器操作特征曲线(AUC)下的面积高达 0.96。这些模型还用于描述具有特定生物学和物理化学特性的化合物集,如暗化学物质、聚集剂、高通量筛选文库中的化合物、类药化合物、已批准药物、潜在的 PAINS 和天然产物。最有趣的结果之一是,新的 Hit Dexter 模型预测了已批准药物中存在大量(高度)混杂化合物。重要的是,单个 Hit Dexter 模型的预测通常与 Badapple 一致,Badapple 是一种用于预测频繁命中者的成熟统计模型。新的 Hit Dexter 2.0 网络服务可在 http://hitdexter2.zbh.uni-hamburg.de 获得,不仅提供了对本文中所有机器学习模型的用户友好访问,还提供了基于相似性的方法,用于预测聚集剂和暗化学物质,以及一个全面的可用规则集,用于标记频繁命中者和包含不期望的亚结构的化合物。