Gaba Sonam, Jamal Salma, Scaria Vinod
GN Ramachandran Knowledge Center for Genome Informatics, CSIR Institute of Genomics and Integrative Biology, Mall Road, Delhi 110007, India.
CSIR Open Source Drug Discovery Unit, Anusandhan Bhawan, 2 Rafi Marg, Delhi 110001, India.
ScientificWorldJournal. 2014;2014:957107. doi: 10.1155/2014/957107. Epub 2014 Nov 25.
Schistosomiasis is a neglected tropical disease caused by a parasite Schistosoma mansoni and affects over 200 million annually. There is an urgent need to discover novel therapeutic options to control the disease with the recent emergence of drug resistance. The multifunctional protein, thioredoxin glutathione reductase (TGR), an essential enzyme for the survival of the pathogen in the redox environment has been actively explored as a potential drug target. The recent availability of small-molecule screening datasets against this target provides a unique opportunity to learn molecular properties and apply computational models for discovery of activities in large molecular libraries. Such a prioritisation approach could have the potential to reduce the cost of failures in lead discovery. A supervised learning approach was employed to develop a cost sensitive classification model to evaluate the biological activity of the molecules. Random forest was identified to be the best classifier among all the classifiers with an accuracy of around 80 percent. Independent analysis using a maximally occurring substructure analysis revealed 10 highly enriched scaffolds in the actives dataset and their docking against was also performed. We show that a combined approach of machine learning and other cheminformatics approaches such as substructure comparison and molecular docking is efficient to prioritise molecules from large molecular datasets.
血吸虫病是一种由曼氏血吸虫寄生虫引起的被忽视的热带疾病,每年影响超过2亿人。随着最近耐药性的出现,迫切需要发现新的治疗方法来控制这种疾病。多功能蛋白硫氧还蛋白谷胱甘肽还原酶(TGR)是病原体在氧化还原环境中生存所必需的酶,已被积极探索作为潜在的药物靶点。最近针对该靶点的小分子筛选数据集的可用性提供了一个独特的机会,来了解分子特性并应用计算模型在大分子库中发现活性。这种优先排序方法有可能降低先导化合物发现中的失败成本。采用监督学习方法开发了一个成本敏感分类模型,以评估分子的生物活性。在所有分类器中,随机森林被确定为最佳分类器,准确率约为80%。使用最大出现子结构分析的独立分析揭示了活性数据集中10个高度富集的支架,并对其进行了对接。我们表明,机器学习与其他化学信息学方法(如子结构比较和分子对接)相结合的方法,对于从大分子数据集中对分子进行优先排序是有效的。