Lu Yuntao, Li Qi, Li Tao
Key Laboratory of Freshwater Ecology and Biotechnology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, China.
College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing, China.
Front Genet. 2022 Apr 4;13:839453. doi: 10.3389/fgene.2022.839453. eCollection 2022.
With the rapid development of sequencing technology, completed genomes of microbes have explosively emerged. For a newly sequenced prokaryotic genome, gene functional annotation and metabolism pathway assignment are important foundations for all subsequent research work. However, the assignment rate for gene metabolism pathways is lower than 48% on the whole. It is even lower for newly sequenced prokaryotic genomes, which has become a bottleneck for subsequent research. Thus, the development of a high-precision metabolic pathway assignment framework is urgently needed. Here, we developed PPA-GCN, a prokaryotic pathways assignment framework based on graph convolutional network, to assist functional pathway assignments using KEGG information and genomic characteristics. In the framework, genomic gene synteny information was used to construct a network, and ideas of self-supervised learning were inspired to enhance the framework's learning ability. Our framework is applicable to the genera of microbe with sufficient whole genome sequences. To evaluate the assignment rate, genomes from three different genera ( (65 genomes) and (100 genomes), (500 genomes)) were used. The initial functional pathway assignment rate of the three test genera were 27.7% (), 49.5% () and 30.1% (). PPA-GCN achieved excellence performance of 84.8% (), 77.0% () and 71.0% () for assignment rate. At the same time, PPA-GCN was proved to have strong fault tolerance. The framework provides novel insights into assignment for metabolism pathways and is likely to inform future deep learning applications for interpreting functional annotations and extends to all prokaryotic genera with sufficient genomes.
随着测序技术的快速发展,微生物的完整基因组大量涌现。对于新测序的原核生物基因组,基因功能注释和代谢途径分配是所有后续研究工作的重要基础。然而,基因代谢途径的分配率总体上低于48%。对于新测序的原核生物基因组,这一比例甚至更低,已成为后续研究的瓶颈。因此,迫切需要开发一种高精度的代谢途径分配框架。在此,我们开发了PPA-GCN,一种基于图卷积网络的原核生物途径分配框架,以利用KEGG信息和基因组特征辅助功能途径分配。在该框架中,利用基因组基因共线性信息构建网络,并启发自监督学习的思想来增强框架的学习能力。我们的框架适用于具有足够全基因组序列的微生物属。为了评估分配率,使用了来自三个不同属( (65个基因组)、 (100个基因组)、 (500个基因组))的基因组。三个测试属的初始功能途径分配率分别为27.7%( )、49.5%( )和30.1%( )。PPA-GCN在分配率方面分别达到了84.8%( )、77.0%( )和71.0%( )的优异性能。同时,PPA-GCN被证明具有很强的容错能力。该框架为代谢途径的分配提供了新的见解,可能为未来解释功能注释的深度学习应用提供参考,并扩展到所有具有足够基因组的原核生物属。