Chemical Engineering and Process Development, CSIR-National Chemical Laboratory, Pune, Maharashtra, India.
Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India.
PLoS One. 2020 Nov 30;15(11):e0242943. doi: 10.1371/journal.pone.0242943. eCollection 2020.
Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.
必需基因预测有助于找到任何生物体生存所必需的最小基因。机器学习 (ML) 算法已被用于预测基因的必需性。然而,目前可用的 ML 管道在实验数据有限的生物体中表现不佳。本研究的目的是开发一种新的 ML 管道,以帮助注释那些实验数据有限的、研究较少的致病生物体的必需基因。所提出的策略结合了无监督特征选择技术、基于 Kamada-Kawai 算法的降维以及半监督机器学习算法,即使用拉普拉斯支持向量机 (LapSVM),从基因组规模的代谢网络中预测必需基因和非必需基因,使用非常有限的标记数据集。由于数据缺乏,提出了一种新的评分技术,即半监督模型选择评分,相当于 ROC 曲线下的面积 (auROC),用于在监督性能指标计算困难时选择最佳模型。无监督特征选择和降维有助于观察必需基因和非必需基因聚类中的明显圆形模式。然后,LapSVM 为分类和预测必需基因创建了一条曲线,该曲线以高精度(auROC>0.85)对必需基因进行了分类和预测,即使对于 1%的标记数据用于模型训练。在对真核生物和原核生物成功验证了该 ML 管道之后,即使标记数据集非常有限,该策略也被用于预测具有不足实验数据的生物体的必需基因,例如利什曼原虫。本研究使用基于图的半监督机器学习方案,提出了一种新的必需基因预测的综合方法,该方法在应用于具有有限标记数据的原核生物和真核生物时具有通用性。使用该管道预测的必需基因为预测基因的必需性和鉴定抗生素和疫苗开发针对致病寄生虫的新型治疗靶点提供了重要线索。