Collaborations Pharmaceuticals, Inc. , Main Campus Drive, Lab 3510 , Raleigh , North Carolina 27606 , United States.
Department of Biochemistry and Biophysics , University of North Carolina , Chapel Hill , North Carolina 27599 , United States.
Mol Pharm. 2018 Oct 1;15(10):4346-4360. doi: 10.1021/acs.molpharmaceut.8b00083. Epub 2018 Apr 26.
Tuberculosis is a global health dilemma. In 2016, the WHO reported 10.4 million incidences and 1.7 million deaths. The need to develop new treatments for those infected with Mycobacterium tuberculosis ( Mtb) has led to many large-scale phenotypic screens and many thousands of new active compounds identified in vitro. However, with limited funding, efforts to discover new active molecules against Mtb needs to be more efficient. Several computational machine learning approaches have been shown to have good enrichment and hit rates. We have curated small molecule Mtb data and developed new models with a total of 18,886 molecules with activity cutoffs of 10 μM, 1 μM, and 100 nM. These data sets were used to evaluate different machine learning methods (including deep learning) and metrics and to generate predictions for additional molecules published in 2017. One Mtb model, a combined in vitro and in vivo data Bayesian model at a 100 nM activity yielded the following metrics for 5-fold cross validation: accuracy = 0.88, precision = 0.22, recall = 0.91, specificity = 0.88, kappa = 0.31, and MCC = 0.41. We have also curated an evaluation set ( n = 153 compounds) published in 2017, and when used to test our model, it showed the comparable statistics (accuracy = 0.83, precision = 0.27, recall = 1.00, specificity = 0.81, kappa = 0.36, and MCC = 0.47). We have also compared these models with additional machine learning algorithms showing Bayesian machine learning models constructed with literature Mtb data generated by different laboratories generally were equivalent to or outperformed deep neural networks with external test sets. Finally, we have also compared our training and test sets to show they were suitably diverse and different in order to represent useful evaluation sets. Such Mtb machine learning models could help prioritize compounds for testing in vitro and in vivo.
结核病是一个全球性的健康难题。2016 年,世界卫生组织报告了 1040 万例病例和 170 万人死亡。为了开发治疗结核分枝杆菌(Mtb)感染的新疗法,已经进行了许多大规模的表型筛选,并在体外发现了数千种新的活性化合物。然而,由于资金有限,需要更有效地发现针对 Mtb 的新活性分子。已经证明几种计算机器学习方法具有良好的富集和命中率。我们已经整理了小分子 Mtb 数据,并使用总共 18886 个具有 10 μM、1 μM 和 100 nM 活性截止值的分子开发了新模型。这些数据集用于评估不同的机器学习方法(包括深度学习)和指标,并对 2017 年发表的其他分子进行预测。一个 Mtb 模型,一个在 100 nM 活性下结合了体外和体内数据的贝叶斯模型,对于 5 倍交叉验证产生了以下指标:准确性=0.88、精度=0.22、召回率=0.91、特异性=0.88、kappa=0.31 和 MCC=0.41。我们还整理了 2017 年发表的一个评估集(n=153 种化合物),当用于测试我们的模型时,它显示出了相当的统计数据(准确性=0.83、精度=0.27、召回率=1.00、特异性=0.81、kappa=0.36 和 MCC=0.47)。我们还比较了这些模型与其他机器学习算法,表明使用文献 Mtb 数据构建的贝叶斯机器学习模型通常与具有外部测试集的深度神经网络等效或表现更好。最后,我们还比较了我们的训练集和测试集,以表明它们足够多样化且不同,以便代表有用的评估集。这种 Mtb 机器学习模型可以帮助确定化合物进行体外和体内测试的优先级。