School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China.
Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computing Intelligence, Jiangnan University, Wuxi, China.
BMC Bioinformatics. 2024 Apr 11;25(1):146. doi: 10.1186/s12859-024-05766-x.
The advent of high-throughput technologies has led to an exponential increase in uncharacterized bacterial protein sequences, surpassing the capacity of manual curation. A large number of bacterial protein sequences remain unannotated by Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology, making it necessary to use auto annotation tools. These tools are now indispensable in the biological research landscape, bridging the gap between the vastness of unannotated sequences and meaningful biological insights.
In this work, we propose a novel pipeline for KEGG orthology annotation of bacterial protein sequences that uses natural language processing and deep learning. To assess the effectiveness of our pipeline, we conducted evaluations using the genomes of two randomly selected species from the KEGG database. In our evaluation, we obtain competitive results on precision, recall, and F1 score, with values of 0.948, 0.947, and 0.947, respectively.
Our experimental results suggest that our pipeline demonstrates performance comparable to traditional methods and excels in identifying distant relatives with low sequence identity. This demonstrates the potential of our pipeline to significantly improve the accuracy and comprehensiveness of KEGG orthology annotation, thereby advancing our understanding of functional relationships within biological systems.
高通量技术的出现导致了未表征细菌蛋白质序列的指数级增长,超过了人工注释的能力。大量的细菌蛋白质序列仍然没有被京都基因与基因组百科全书(KEGG)直系同源注释,因此需要使用自动注释工具。这些工具在生物研究领域已经不可或缺,它们在大量未注释的序列和有意义的生物学见解之间架起了桥梁。
在这项工作中,我们提出了一种使用自然语言处理和深度学习的新型细菌蛋白质序列 KEGG 直系同源注释流水线。为了评估我们的流水线的有效性,我们使用 KEGG 数据库中随机选择的两个物种的基因组进行了评估。在我们的评估中,我们在精度、召回率和 F1 分数方面获得了有竞争力的结果,分别为 0.948、0.947 和 0.947。
我们的实验结果表明,我们的流水线表现与传统方法相当,并且在识别低序列同一性的远缘亲属方面表现出色。这表明我们的流水线有潜力显著提高 KEGG 直系同源注释的准确性和全面性,从而促进我们对生物系统内功能关系的理解。