Fang Xingang, Bagui Sikha, Bagui Subhash
Department of Computer Science, University of West Florida, Pensacola, FL 32514, United States.
Department of Mathematics and Statistics, University of West Florida, Pensacola, FL 32514, United States.
Comput Biol Chem. 2017 Aug;69:110-119. doi: 10.1016/j.compbiolchem.2017.05.007. Epub 2017 May 29.
The readily available high throughput screening (HTS) data from the PubChem database provides an opportunity for mining of small molecules in a variety of biological systems using machine learning techniques. From the thousands of available molecular descriptors developed to encode useful chemical information representing the characteristics of molecules, descriptor selection is an essential step in building an optimal quantitative structural-activity relationship (QSAR) model. For the development of a systematic descriptor selection strategy, we need the understanding of the relationship between: (i) the descriptor selection; (ii) the choice of the machine learning model; and (iii) the characteristics of the target bio-molecule. In this work, we employed the Signature descriptor to generate a dataset on the Human kallikrein 5 (hK 5) inhibition confirmatory assay data and compared multiple classification models including logistic regression, support vector machine, random forest and k-nearest neighbor. Under optimal conditions, the logistic regression model provided extremely high overall accuracy (98%) and precision (90%), with good sensitivity (65%) in the cross validation test. In testing the primary HTS screening data with more than 200K molecular structures, the logistic regression model exhibited the capability of eliminating more than 99.9% of the inactive structures. As part of our exploration of the descriptor-model-target relationship, the excellent predictive performance of the combination of the Signature descriptor and the logistic regression model on the assay data of the Human kallikrein 5 (hK 5) target suggested a feasible descriptor/model selection strategy on similar targets.
来自PubChem数据库的现成高通量筛选(HTS)数据为利用机器学习技术在各种生物系统中挖掘小分子提供了机会。在数千种用于编码表示分子特征的有用化学信息而开发的分子描述符中,描述符选择是构建最佳定量构效关系(QSAR)模型的关键步骤。为了制定系统的描述符选择策略,我们需要了解以下三者之间的关系:(i)描述符选择;(ii)机器学习模型的选择;(iii)目标生物分子的特征。在这项工作中,我们使用Signature描述符生成了关于人激肽释放酶5(hK 5)抑制确证试验数据的数据集,并比较了包括逻辑回归、支持向量机、随机森林和k近邻在内的多种分类模型。在最佳条件下,逻辑回归模型在交叉验证测试中提供了极高的总体准确率(98%)和精确率(90%),以及良好的灵敏度(65%)。在用超过20万个分子结构测试原始HTS筛选数据时,逻辑回归模型表现出能够排除超过99.9%的无活性结构的能力。作为我们对描述符-模型-目标关系探索的一部分,Signature描述符和逻辑回归模型的组合在人激肽释放酶5(hK 5)靶点的试验数据上的出色预测性能表明了一种针对类似靶点的可行描述符/模型选择策略。