Xiao Yao, Zhang Yan
Shenzhen Key Laboratory of Marine Bioresources and Ecology, Brain Disease and Big Data Research Institute, College of Life Sciences and Oceanography, Shenzhen University, Shenzhen, Guangdong, China.
Shenzhen-Hong Kong Institute of Brain Science-Shenzhen Fundamental Research Institutions, Shenzhen, Guangdong, China.
mSystems. 2025 Apr 22;10(4):e0125824. doi: 10.1128/msystems.01258-24. Epub 2025 Mar 10.
Selenoproteins are a special group of proteins with major roles in cellular antioxidant defense. They contain the 21st amino acid selenocysteine (Sec) in the active sites, which is encoded by an in-frame UGA codon. Compared to eukaryotes, identification of selenoprotein genes in bacteria remains challenging due to the absence of an effective strategy for distinguishing the Sec-encoding UGA codon from a normal stop signal. In this study, we have developed a deep learning-based algorithm, deep-Sep, for quickly and precisely identifying selenoprotein genes in bacterial genomic sequences. This algorithm uses a Transformer-based neural network architecture to construct an optimal model for detecting Sec-encoding UGA codons and a homology search-based strategy to remove additional false positives. During the training and testing stages, deep-Sep has demonstrated commendable performance, including an score of 0.939 and an area under the receiver operating characteristic curve of 0.987. Furthermore, when applied to 20 bacterial genomes as independent test data sets, deep-Sep exhibited remarkable capability in identifying both known and new selenoprotein genes, which significantly outperforms the existing state-of-the-art method. Our algorithm has proved to be a powerful tool for comprehensively characterizing selenoprotein genes in bacterial genomes, which should not only assist in accurate annotation of selenoprotein genes in genome sequencing projects but also provide new insights for a deeper understanding of the roles of selenium in bacteria.IMPORTANCESelenium is an essential micronutrient present in selenoproteins in the form of Sec, which is a rare amino acid encoded by the opal stop codon UGA. Identification of all selenoproteins is of vital importance for investigating the functions of selenium in nature. Previous strategies for predicting selenoprotein genes mainly relied on the identification of a special -acting Sec insertion sequence (SECIS) element within mRNAs. However, due to the complexity and variability of SECIS elements, recognition of all selenoprotein genes in bacteria is still a major challenge in the annotation of bacterial genomes. We have developed a deep learning-based algorithm to predict selenoprotein genes in bacterial genomic sequences, which demonstrates superior performance compared to currently available methods. This algorithm can be utilized in either web-based or local (standalone) modes, serving as a promising tool for identifying the complete set of selenoprotein genes in bacteria.
硒蛋白是一类特殊的蛋白质,在细胞抗氧化防御中发挥着重要作用。它们在活性位点含有第21种氨基酸硒代半胱氨酸(Sec),该氨基酸由框内UGA密码子编码。与真核生物相比,由于缺乏将编码Sec的UGA密码子与正常终止信号区分开的有效策略,细菌中硒蛋白基因的鉴定仍然具有挑战性。在本研究中,我们开发了一种基于深度学习的算法deep-Sep,用于快速、准确地鉴定细菌基因组序列中的硒蛋白基因。该算法使用基于Transformer的神经网络架构构建一个用于检测编码Sec的UGA密码子的优化模型,并采用基于同源性搜索的策略来去除额外的假阳性。在训练和测试阶段,deep-Sep表现出了令人称赞的性能,包括0.939的F1分数和0.987的受试者工作特征曲线下面积。此外,当将其应用于20个细菌基因组作为独立测试数据集时,deep-Sep在鉴定已知和新的硒蛋白基因方面表现出卓越的能力,显著优于现有的最先进方法。我们的算法已被证明是全面表征细菌基因组中硒蛋白基因的有力工具,这不仅有助于在基因组测序项目中准确注释硒蛋白基因,还能为更深入了解硒在细菌中的作用提供新的见解。重要性硒是以Sec形式存在于硒蛋白中的一种必需微量营养素,Sec是由乳白终止密码子UGA编码的稀有氨基酸。鉴定所有硒蛋白对于研究自然界中硒的功能至关重要。以前预测硒蛋白基因的策略主要依赖于识别mRNA内特殊的Sec插入序列(SECIS)元件。然而,由于SECIS元件的复杂性和变异性,识别细菌中的所有硒蛋白基因仍然是细菌基因组注释中的一项重大挑战。我们开发了一种基于深度学习的算法来预测细菌基因组序列中的硒蛋白基因,与现有方法相比,该算法表现出卓越的性能。该算法可以以基于网络或本地(独立)模式使用,是识别细菌中完整硒蛋白基因集的有前途的工具。