Bharti Deepak R, Lynn Andrew M
School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-67.
Bioinformation. 2017 May 31;13(5):154-159. doi: 10.6026/97320630013154. eCollection 2017.
Malaria is a predominant infectious disease, with a global footprint, but especially severe in developing countries in the African subcontinent. In recent years, drug-resistant malaria has become an alarming factor, and hence the requirement of new and improved drugs is more crucial than ever before. One of the promising locations for antimalarial drug target is the apicoplast, as this organelle does not occur in humans. The apicoplast is associated with many unique and essential pathways in many Apicomplexan pathogens, including Plasmodium. The use of machine learning methods is now commonly available through open source programs. In the present work, we describe a standard protocol to develop molecular descriptor based predictive models (QSAR models), which can be further utilized for the screening of large chemical libraries. This protocol is used to build models using training data sourced from apicoplast specific bioassays. Multiple model building methods are used including Generalized Linear Models (GLM), Random Forest (RF), C5.0 implementation of a decision tree, Support Vector Machines (SVM), K-Nearest Neighbour and Naive Bayes. Methods to evaluate the accuracy of the model building method are included in the protocol. For the given dataset, the C5.0, SVM and RF perform better than other methods, with comparable accuracy over the test data.
疟疾是一种主要的传染病,在全球范围内存在,但在非洲次大陆的发展中国家尤为严重。近年来,耐药性疟疾已成为一个令人担忧的因素,因此对新型和改良药物的需求比以往任何时候都更加迫切。抗疟药物靶点的一个有前景的部位是顶质体,因为这种细胞器在人类中不存在。顶质体与许多顶复门病原体(包括疟原虫)中的许多独特且必不可少的途径相关。现在可以通过开源程序普遍使用机器学习方法。在本工作中,我们描述了一种开发基于分子描述符的预测模型(定量构效关系模型)的标准方案,该模型可进一步用于筛选大型化学文库。该方案用于使用源自顶质体特异性生物测定的训练数据构建模型。使用了多种模型构建方法,包括广义线性模型(GLM)、随机森林(RF)、决策树的C5.0实现、支持向量机(SVM)、K近邻和朴素贝叶斯。该方案中包含评估模型构建方法准确性的方法。对于给定的数据集,C5.0、SVM和RF的表现优于其他方法,在测试数据上具有相当的准确性。