State Key Laboratory of Crop Biology, Shandong Agricultural University, Taian, 273100, China.
Quantitative Life Sciences Initiative, Center for Plant Science Innovation, and Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
Plant Genome. 2020 Jul;13(2):e20015. doi: 10.1002/tpg2.20015. Epub 2020 Apr 29.
Advances in genome sequencing and annotation have eased the difficulty of identifying new gene sequences. Predicting the functions of these newly identified genes remains challenging. Genes descended from a common ancestral sequence are likely to have common functions. As a result, homology is widely used for gene function prediction. This means functional annotation errors also propagate from one species to another. Several approaches based on machine learning classification algorithms were evaluated for their ability to accurately predict gene function from non-homology gene features. Among the eight supervised classification algorithms evaluated, random-forest-based prediction consistently provided the most accurate gene function prediction. Non-homology-based functional annotation provides complementary strengths to homology-based annotation, with higher average performance in Biological Process GO terms, the domain where homology-based functional annotation performs the worst, and weaker performance in Molecular Function GO terms, the domain where the accuracy of homology-based functional annotation is highest. GO prediction models trained with homology-based annotations were able to successfully predict annotations from a manually curated "gold standard" GO annotation set. Non-homology-based functional annotation based on machine learning may ultimately prove useful both as a method to assign predicted functions to orphan genes which lack functionally characterized homologs, and to identify and correct functional annotation errors which were propagated through homology-based functional annotations.
基因组测序和注释的进展降低了识别新基因序列的难度。预测这些新鉴定基因的功能仍然具有挑战性。来自共同祖先序列的基因很可能具有共同的功能。因此,同源性被广泛用于基因功能预测。这意味着功能注释错误也会从一个物种传播到另一个物种。评估了几种基于机器学习分类算法的方法,以评估它们从非同源基因特征准确预测基因功能的能力。在所评估的八种有监督分类算法中,基于随机森林的预测方法始终提供最准确的基因功能预测。基于非同源性的功能注释为基于同源性的注释提供了互补优势,在基于同源性的功能注释表现最差的生物过程 GO 术语中具有更高的平均性能,而在基于同源性的功能注释准确性最高的分子功能 GO 术语中性能较弱。基于同源性注释训练的 GO 预测模型能够成功地预测来自手动整理的“黄金标准”GO 注释集的注释。基于机器学习的基于非同源性的功能注释最终可能被证明是有用的,既可以将预测的功能分配给缺乏功能特征同源物的孤儿基因,也可以识别和纠正通过基于同源性的功能注释传播的功能注释错误。