利用有限的基因必需性信息进行必需基因预测——一种综合的半监督机器学习策略。

Essential gene prediction using limited gene essentiality information-An integrative semi-supervised machine learning strategy.

机构信息

Chemical Engineering and Process Development, CSIR-National Chemical Laboratory, Pune, Maharashtra, India.

Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India.

出版信息

PLoS One. 2020 Nov 30;15(11):e0242943. doi: 10.1371/journal.pone.0242943. eCollection 2020.

DOI:10.1371/journal.pone.0242943

PMID:33253254

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7703937/

Abstract

Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.

摘要

必需基因预测有助于找到任何生物体生存所必需的最小基因。机器学习 (ML) 算法已被用于预测基因的必需性。然而，目前可用的 ML 管道在实验数据有限的生物体中表现不佳。本研究的目的是开发一种新的 ML 管道，以帮助注释那些实验数据有限的、研究较少的致病生物体的必需基因。所提出的策略结合了无监督特征选择技术、基于 Kamada-Kawai 算法的降维以及半监督机器学习算法，即使用拉普拉斯支持向量机 (LapSVM)，从基因组规模的代谢网络中预测必需基因和非必需基因，使用非常有限的标记数据集。由于数据缺乏，提出了一种新的评分技术，即半监督模型选择评分，相当于 ROC 曲线下的面积 (auROC)，用于在监督性能指标计算困难时选择最佳模型。无监督特征选择和降维有助于观察必需基因和非必需基因聚类中的明显圆形模式。然后，LapSVM 为分类和预测必需基因创建了一条曲线，该曲线以高精度（auROC>0.85）对必需基因进行了分类和预测，即使对于 1%的标记数据用于模型训练。在对真核生物和原核生物成功验证了该 ML 管道之后，即使标记数据集非常有限，该策略也被用于预测具有不足实验数据的生物体的必需基因，例如利什曼原虫。本研究使用基于图的半监督机器学习方案，提出了一种新的必需基因预测的综合方法，该方法在应用于具有有限标记数据的原核生物和真核生物时具有通用性。使用该管道预测的必需基因为预测基因的必需性和鉴定抗生素和疫苗开发针对致病寄生虫的新型治疗靶点提供了重要线索。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56e0/7703937/c431eecd9b33/pone.0242943.g001.jpg

相似文献

Essential gene prediction using limited gene essentiality information-An integrative semi-supervised machine learning strategy.利用有限的基因必需性信息进行必需基因预测——一种综合的半监督机器学习策略。

PLoS One. 2020 Nov 30;15(11):e0242943. doi: 10.1371/journal.pone.0242943. eCollection 2020.

Machine learning approach to gene essentiality prediction: a review.机器学习在基因必需性预测中的应用：综述。

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab128.

Predicting essential genes of 41 prokaryotes by a semi-supervised method.通过半监督方法预测 41 种原核生物的必需基因。

Anal Biochem. 2020 Nov 15;609:113919. doi: 10.1016/j.ab.2020.113919. Epub 2020 Aug 19.

An integrative machine learning strategy for improved prediction of essential genes in Escherichia coli metabolism using flux-coupled features.一种利用通量耦合特征改进大肠杆菌代谢中必需基因预测的综合机器学习策略。

Mol Biosyst. 2017 Jul 25;13(8):1584-1596. doi: 10.1039/c7mb00234c.

A Machine Learning Approach for Predicting Essentiality of Metabolic Genes.基于机器学习的代谢基因必需性预测方法。

Methods Mol Biol. 2024;2760:345-369. doi: 10.1007/978-1-0716-3658-9_20.

EPGAT: Gene Essentiality Prediction With Graph Attention Networks.EPGAT：基于图注意力网络的基因必需性预测。

IEEE/ACM Trans Comput Biol Bioinform. 2022 May-Jun;19(3):1615-1626. doi: 10.1109/TCBB.2021.3054738. Epub 2022 Jun 3.

A novel candidate disease gene prioritization method using deep graph convolutional networks and semi-supervised learning.一种使用深度图卷积网络和半监督学习的新型候选疾病基因优先级排序方法。

BMC Bioinformatics. 2022 Oct 14;23(1):422. doi: 10.1186/s12859-022-04954-x.

A semi-supervised learning based method: Laplacian support vector machine used in diabetes disease diagnosis.基于半监督学习的方法：拉普拉斯支持向量机在糖尿病疾病诊断中的应用。

Interdiscip Sci. 2009 Jun;1(2):151-5. doi: 10.1007/s12539-009-0016-2. Epub 2009 May 28.

A semi-supervised machine learning framework for microRNA classification.一种用于 microRNA 分类的半监督机器学习框架。

Hum Genomics. 2019 Oct 22;13(Suppl 1):43. doi: 10.1186/s40246-019-0221-7.

Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture.使用混合特征选择方法和深度学习架构增强从基因表达谱预测浸润性导管癌乳腺癌分期的能力。

Med Biol Eng Comput. 2023 Nov;61(11):2895-2919. doi: 10.1007/s11517-023-02892-1. Epub 2023 Aug 2.

引用本文的文献

A hybrid machine learning model with attention mechanism and multidimensional multivariate feature coding for essential gene prediction.一种具有注意力机制和多维多变量特征编码的混合机器学习模型用于必需基因预测。

BMC Biol. 2025 Apr 24;23(1):108. doi: 10.1186/s12915-025-02209-8.

Machine learning methods for predicting essential metabolic genes from Plasmodium falciparum genome-scale metabolic network.基于恶性疟原虫基因组规模代谢网络预测必需代谢基因的机器学习方法

PLoS One. 2024 Dec 23;19(12):e0315530. doi: 10.1371/journal.pone.0315530. eCollection 2024.

Untangling the Context-Specificity of Essential Genes by Means of Machine Learning: A Constructive Experience.通过机器学习理清必需基因的语境特异性：一种建设性的经验。

Biomolecules. 2023 Dec 22;14(1):18. doi: 10.3390/biom14010018.

Genome engineering on size reduction and complexity simplification: A review.基因组工程的规模缩减与复杂性简化：综述。

J Adv Res. 2024 Jun;60:159-171. doi: 10.1016/j.jare.2023.07.006. Epub 2023 Jul 12.

Integration of text mining and biological network analysis: Identification of essential genes in sulfate-reducing bacteria.文本挖掘与生物网络分析的整合：硫酸盐还原菌中必需基因的鉴定

Front Microbiol. 2023 Apr 13;14:1086021. doi: 10.3389/fmicb.2023.1086021. eCollection 2023.

J Biosci. 2022;47(2). doi: 10.1007/s12038-022-00253-y.

本文引用的文献

DeeplyEssential: a deep neural network for predicting essential genes in microbes.深度必需：一种用于预测微生物必需基因的深度神经网络。

BMC Bioinformatics. 2020 Sep 30;21(Suppl 14):367. doi: 10.1186/s12859-020-03688-y.

Application of deep learning methods in biological networks.深度学习方法在生物网络中的应用。

Brief Bioinform. 2021 Mar 22;22(2):1902-1917. doi: 10.1093/bib/bbaa043.

Machine and deep learning meet genome-scale metabolic modeling.机器学习和深度学习与基因组规模代谢建模相遇。

PLoS Comput Biol. 2019 Jul 11;15(7):e1007084. doi: 10.1371/journal.pcbi.1007084. eCollection 2019 Jul.

Network-based methods for predicting essential genes or proteins: a survey.基于网络的方法预测必需基因或蛋白质：综述。

Brief Bioinform. 2020 Mar 23;21(2):566-583. doi: 10.1093/bib/bbz017.

Computational methods for identifying the critical nodes in biological networks.生物网络中关键节点的识别计算方法。

Brief Bioinform. 2020 Mar 23;21(2):486-497. doi: 10.1093/bib/bbz011.

Network-based features enable prediction of essential genes across diverse organisms.基于网络的特征可实现跨多种生物的必需基因预测。

PLoS One. 2018 Dec 13;13(12):e0208722. doi: 10.1371/journal.pone.0208722. eCollection 2018.

The Gene Ontology Resource: 20 years and still GOing strong.《基因本体论资源：20 年，持续强大》

Nucleic Acids Res. 2019 Jan 8;47(D1):D330-D338. doi: 10.1093/nar/gky1055.

UniProt: a worldwide hub of protein knowledge.UniProt：蛋白质知识的全球枢纽。

Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. doi: 10.1093/nar/gky1049.

New approach for understanding genome variations in KEGG.KEGG 中基因组变异的新方法。

Nucleic Acids Res. 2019 Jan 8;47(D1):D590-D595. doi: 10.1093/nar/gky962.

Perspectives on Leishmania Species and Stage-specific Adaptive Mechanisms.利什曼原虫物种与阶段特异性适应机制的研究进展

Trends Parasitol. 2018 Dec;34(12):1068-1081. doi: 10.1016/j.pt.2018.09.004. Epub 2018 Oct 11.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用有限的基因必需性信息进行必需基因预测——一种综合的半监督机器学习策略。

Essential gene prediction using limited gene essentiality information-An integrative semi-supervised machine learning strategy.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献