Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria.
Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria.
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab128.
Essential genes are critical for the growth and survival of any organism. The machine learning approach complements the experimental methods to minimize the resources required for essentiality assays. Previous studies revealed the need to discover relevant features that significantly classify essential genes, improve on the generalizability of prediction models across organisms, and construct a robust gold standard as the class label for the train data to enhance prediction. Findings also show that a significant limitation of the machine learning approach is predicting conditionally essential genes. The essentiality status of a gene can change due to a specific condition of the organism. This review examines various methods applied to essential gene prediction task, their strengths, limitations and the factors responsible for effective computational prediction of essential genes. We discussed categories of features and how they contribute to the classification performance of essentiality prediction models. Five categories of features, namely, gene sequence, protein sequence, network topology, homology and gene ontology-based features, were generated for Caenorhabditis elegans to perform a comparative analysis of their essentiality prediction capacity. Gene ontology-based feature category outperformed other categories of features majorly due to its high correlation with the genes' biological functions. However, the topology feature category provided the highest discriminatory power making it more suitable for essentiality prediction. The major limiting factor of machine learning to predict essential genes conditionality is the unavailability of labeled data for interest conditions that can train a classifier. Therefore, cooperative machine learning could further exploit models that can perform well in conditional essentiality predictions.
Identification of essential genes is imperative because it provides an understanding of the core structure and function, accelerating drug targets' discovery, among other functions. Recent studies have applied machine learning to complement the experimental identification of essential genes. However, several factors are limiting the performance of machine learning approaches. This review aims to present the standard procedure and resources available for predicting essential genes in organisms, and also highlight the factors responsible for the current limitation in using machine learning for conditional gene essentiality prediction. The choice of features and ML technique was identified as an important factor to predict essential genes effectively.
必需基因对于任何生物的生长和存活都是至关重要的。机器学习方法补充了实验方法,以最小化必需性测定所需的资源。以前的研究表明,需要发现显著分类必需基因的相关特征,提高预测模型在生物体之间的泛化能力,并构建稳健的黄金标准作为训练数据的类标签,以增强预测。研究结果还表明,机器学习方法的一个显著局限性是预测条件必需基因。由于生物体的特定条件,基因的必需性状态可能会发生变化。本综述检查了应用于必需基因预测任务的各种方法,它们的优缺点以及有效计算预测必需基因的因素。我们讨论了特征类别以及它们如何有助于必需性预测模型的分类性能。为了对其必需性预测能力进行比较分析,针对秀丽隐杆线虫生成了五类特征,即基因序列、蛋白质序列、网络拓扑、同源性和基于基因本体论的特征。基于基因本体论的特征类别表现优于其他特征类别,主要是由于其与基因的生物学功能高度相关。然而,拓扑特征类别提供了最高的区分能力,使其更适合必需性预测。机器学习预测必需基因条件性的主要限制因素是缺乏感兴趣条件的标记数据,这些数据可以训练分类器。因此,合作机器学习可以进一步利用能够在条件必需性预测中表现良好的模型。
鉴定必需基因至关重要,因为它提供了对核心结构和功能的理解,加速了药物靶点的发现等功能。最近的研究已经应用机器学习来补充必需基因的实验鉴定。然而,有几个因素限制了机器学习方法的性能。本综述旨在介绍预测生物体中必需基因的标准程序和可用资源,并强调当前在使用机器学习进行条件基因必需性预测方面的限制的原因。特征和 ML 技术的选择被确定为有效预测必需基因的重要因素。