Riniker Sereina, Wang Yuan, Jenkins Jeremy L, Landrum Gregory A
Novartis Institutes for BioMedical Research, Novartis Pharma AG , Novartis Campus, 4056 Basel, Switzerland.
J Chem Inf Model. 2014 Jul 28;54(7):1880-91. doi: 10.1021/ci500190p. Epub 2014 Jun 26.
Modern high-throughput screening (HTS) is a well-established approach for hit finding in drug discovery that is routinely employed in the pharmaceutical industry to screen more than a million compounds within a few weeks. However, as the industry shifts to more disease-relevant but more complex phenotypic screens, the focus has moved to piloting smaller but smarter chemically/biologically diverse subsets followed by an expansion around hit compounds. One standard method for doing this is to train a machine-learning (ML) model with the chemical fingerprints of the tested subset of molecules and then select the next compounds based on the predictions of this model. An alternative approach would be to take advantage of the wealth of bioactivity information contained in older (full-deck) screens using so-called HTS fingerprints, where each element of the fingerprint corresponds to the outcome of a particular assay, as input to machine-learning algorithms. We constructed HTS fingerprints using two collections of data: 93 in-house assays and 95 publicly available assays from PubChem. For each source, an additional set of 51 and 46 assays, respectively, was collected for testing. Three different ML methods, random forest (RF), logistic regression (LR), and naïve Bayes (NB), were investigated for both the HTS fingerprint and a chemical fingerprint, Morgan2. RF was found to be best suited for learning from HTS fingerprints yielding area under the receiver operating characteristic curve (AUC) values >0.8 for 78% of the internal assays and enrichment factors at 5% (EF(5%)) >10 for 55% of the assays. The RF(HTS-fp) generally outperformed the LR trained with Morgan2, which was the best ML method for the chemical fingerprint, for the majority of assays. In addition, HTS fingerprints were found to retrieve more diverse chemotypes. Combining the two models through heterogeneous classifier fusion led to a similar or better performance than the best individual model for all assays. Further validation using a pair of in-house assays and data from a confirmatory screen--including a prospective set of around 2000 compounds selected based on our approach--confirmed the good performance. Thus, the combination of machine-learning with HTS fingerprints and chemical fingerprints utilizes information from both domains and presents a very promising approach for hit expansion, leading to more hits. The source code used with the public data is provided.
现代高通量筛选(HTS)是药物发现中一种成熟的寻找活性化合物的方法,制药行业经常使用该方法在几周内筛选超过一百万个化合物。然而,随着该行业转向更具疾病相关性但更复杂的表型筛选,重点已转向试点规模较小但更智能的化学/生物多样性子集,然后围绕活性化合物进行扩展。一种标准方法是使用测试分子子集的化学指纹训练机器学习(ML)模型,然后根据该模型的预测选择下一批化合物。另一种方法是利用旧的(完整数据集)筛选中包含的大量生物活性信息,使用所谓的HTS指纹,其中指纹的每个元素对应于特定测定的结果,作为机器学习算法的输入。我们使用两组数据构建了HTS指纹:93个内部测定和来自PubChem的95个公开可用测定。对于每个来源,分别收集了另外51个和46个测定用于测试。针对HTS指纹和化学指纹Morgan2,研究了三种不同的ML方法,随机森林(RF)、逻辑回归(LR)和朴素贝叶斯(NB)。发现RF最适合从HTS指纹中学习,对于78%的内部测定,其在接受者操作特征曲线(AUC)下的面积值>0.8,对于55%的测定,5%的富集因子(EF(5%))>10。对于大多数测定,RF(HTS-fp)通常优于用Morgan2训练的LR,而Morgan2是化学指纹的最佳ML方法。此外,发现HTS指纹能检索到更多样化的化学类型。通过异构分类器融合将这两个模型结合起来,在所有测定中产生的性能与最佳单个模型相似或更好。使用一对内部测定和来自确认性筛选的数据进行进一步验证——包括根据我们的方法选择的约2000种化合物的前瞻性集合——证实了良好的性能。因此,将机器学习与HTS指纹和化学指纹相结合利用了两个领域的信息,并为活性化合物扩展提供了一种非常有前景的方法,从而产生更多的活性化合物。提供了与公共数据一起使用的源代码。