微生物基因组中小开放阅读框的自动预测和注释。

Automated Prediction and Annotation of Small Open Reading Frames in Microbial Genomes.

机构信息

Department of Genetics, Stanford University, Stanford, CA 94305, USA; Department of Medicine (Hematology, Blood and Marrow Transplantation), Stanford University, Stanford, CA 94305, USA.

出版信息

Cell Host Microbe. 2021 Jan 13;29(1):121-131.e4. doi: 10.1016/j.chom.2020.11.002. Epub 2020 Dec 7.

DOI:10.1016/j.chom.2020.11.002

PMID:33290720

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7856248/

Abstract

Small open reading frames (smORFs) and their encoded microproteins play central roles in microbes. However, there is a vast unexplored space of smORFs within human-associated microbes. A recent bioinformatic analysis used evolutionary conservation signals to enhance prediction of small protein families. To facilitate the annotation of specific smORFs, we introduce SmORFinder. This tool combines profile hidden Markov models of each smORF family and deep learning models that better generalize to smORF families not seen in the training set, resulting in predictions enriched for Ribo-seq translation signals. Feature importance analysis reveals that the deep learning models learn to identify Shine-Dalgarno sequences, deprioritize the wobble position in each codon, and group codon synonyms found in the codon table. A core-genome analysis of 26 bacterial species identifies several core smORFs of unknown function. We pre-compute smORF annotations for thousands of RefSeq isolate genomes and Human Microbiome Project metagenomes and provide these data through a public web portal.

摘要

小开放阅读框（smORFs）及其编码的微蛋白在微生物中起着核心作用。然而，在与人类相关的微生物中，仍有大量尚未探索的 smORFs。最近的一项生物信息学分析利用进化保守信号来增强对小蛋白家族的预测。为了方便特定 smORF 的注释，我们引入了 SmORFinder。该工具结合了每个 smORF 家族的轮廓隐马尔可夫模型和能够更好地泛化到训练集中未见过的 smORF 家族的深度学习模型，从而使预测结果富含核糖体测序翻译信号。特征重要性分析表明，深度学习模型学会了识别 Shine-Dalgarno 序列，降低每个密码子中摆动位置的优先级，并将密码子表中发现的密码子同义词分组。对 26 种细菌物种的核心基因组分析确定了几个未知功能的核心 smORFs。我们为数千个 RefSeq 分离基因组和人类微生物组计划宏基因组预先计算了 smORF 注释，并通过公共网络门户提供这些数据。

相似文献

Automated Prediction and Annotation of Small Open Reading Frames in Microbial Genomes.微生物基因组中小开放阅读框的自动预测和注释。

Cell Host Microbe. 2021 Jan 13;29(1):121-131.e4. doi: 10.1016/j.chom.2020.11.002. Epub 2020 Dec 7.

Accurate annotation of human protein-coding small open reading frames.准确注释人类蛋白质编码的小开放阅读框。

Nat Chem Biol. 2020 Apr;16(4):458-468. doi: 10.1038/s41589-019-0425-0. Epub 2019 Dec 9.

smORFunction: a tool for predicting functions of small open reading frames and microproteins.smORFunction：一种预测小开放阅读框和微蛋白功能的工具。

BMC Bioinformatics. 2020 Oct 14;21(1):455. doi: 10.1186/s12859-020-03805-x.

Identification of Novel Bacterial Microproteins Encoded by Small Open Reading Frames Using a Computational Proteogenomics Workflow.基于计算蛋白质组学工作流程鉴定由小开放阅读框编码的新型细菌微蛋白。

Methods Mol Biol. 2024;2836:19-34. doi: 10.1007/978-1-0716-4007-4_2.

Comparison of software packages for detecting unannotated translated small open reading frames by Ribo-seq.通过 Ribo-seq 检测未注释翻译的小开放阅读框的软件包比较。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae268.

A catalog of small proteins from the global microbiome.全球微生物组中的小分子蛋白质目录。

Nat Commun. 2024 Aug 31;15(1):7563. doi: 10.1038/s41467-024-51894-6.

Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq.通过多聚核糖体测序揭示的小开放阅读框的广泛翻译。

Elife. 2014 Aug 21;3:e03528. doi: 10.7554/eLife.03528.

ProsmORF-pred: a machine learning-based method for the identification of small ORFs in prokaryotic genomes.ProsmORF-pred：一种基于机器学习的方法，用于鉴定原核基因组中的小开放阅读框。

Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad101.

A vast pool of lineage-specific microproteins encoded by long non-coding RNAs in plants.植物中长非编码 RNA 编码的大量谱系特异性微蛋白。

Nucleic Acids Res. 2021 Oct 11;49(18):10328-10346. doi: 10.1093/nar/gkab816.

smORFer: a modular algorithm to detect small ORFs in prokaryotes.smORFer：一种用于在原核生物中检测小开放阅读框的模块化算法。

Nucleic Acids Res. 2021 Sep 7;49(15):e89. doi: 10.1093/nar/gkab477.

引用本文的文献

ShortStop: a machine learning framework for microprotein discovery.ShortStop：一种用于微小蛋白质发现的机器学习框架。

BMC Methods. 2025;2(1):16. doi: 10.1186/s44330-025-00037-4. Epub 2025 Aug 1.

De novo gene birth and the conundrum of ORFan genes in bacteria.细菌中的从头基因诞生与孤儿基因难题

Genome Res. 2025 Aug 1;35(8):1679-1688. doi: 10.1101/gr.280157.124.

Cutting-edge deep-learning based tools for metagenomic research.用于宏基因组学研究的前沿深度学习工具。

Natl Sci Rev. 2025 Feb 19;12(6):nwaf056. doi: 10.1093/nsr/nwaf056. eCollection 2025 Jun.

Genomic insights into the spread of methicillin-resistant Staphylococcus aureus involved in ear infections.对引起耳部感染的耐甲氧西林金黄色葡萄球菌传播的基因组学见解。

BMC Infect Dis. 2025 May 6;25(1):661. doi: 10.1186/s12879-025-11052-9.

Eukaryotic Microproteins.真核生物微小蛋白

Annu Rev Biochem. 2025 Jun;94(1):1-28. doi: 10.1146/annurev-biochem-080124-012840. Epub 2025 Apr 17.

Complementary Ribo-seq approaches map the translatome and provide a small protein census in the foodborne pathogen Campylobacter jejuni.互补核糖体测序方法绘制了翻译组图谱，并对食源性病原体空肠弯曲菌进行了小型蛋白质普查。

Nat Commun. 2025 Mar 30;16(1):3078. doi: 10.1038/s41467-025-58329-w.

The hidden bacterial microproteome.隐藏的细菌微蛋白质组

Mol Cell. 2025 Mar 6;85(5):1024-1041.e6. doi: 10.1016/j.molcel.2025.01.025. Epub 2025 Feb 19.

Dual quorum-sensing control of purine biosynthesis drives pathogenic fitness of .嘌呤生物合成的双群体感应控制驱动了……的致病适应性。（原文中“of”后面缺少具体内容）

bioRxiv. 2024 Aug 13:2024.08.13.607696. doi: 10.1101/2024.08.13.607696.

Origins of Life: The Protein Folding Problem all over again?生命起源：蛋白质折叠问题再现？

Proc Natl Acad Sci U S A. 2024 Aug 20;121(34):e2315000121. doi: 10.1073/pnas.2315000121. Epub 2024 Aug 12.

PSPI: A deep learning approach for prokaryotic small protein identification.PSPI：一种用于原核小蛋白识别的深度学习方法。

Front Genet. 2024 Jul 10;15:1439423. doi: 10.3389/fgene.2024.1439423. eCollection 2024.

本文引用的文献

MetaRibo-Seq measures translation in microbiomes.MetaRibo-Seq 可用于测量微生物组中的翻译情况。

Nat Commun. 2020 Jun 29;11(1):3268. doi: 10.1038/s41467-020-17081-z.

Function is what counts: how microbial community complexity affects species, proteome and pathway coverage in metaproteomics.功能才是关键：微生物群落复杂性如何影响宏蛋白质组学中的物种、蛋白质组和途径覆盖度。

Expert Rev Proteomics. 2020 Feb;17(2):163-173. doi: 10.1080/14789450.2020.1738931. Epub 2020 Mar 15.

CDD/SPARCLE: the conserved domain database in 2020.CDD/SPARCLE：2020 年的保守结构域数据库。

Nucleic Acids Res. 2020 Jan 8;48(D1):D265-D268. doi: 10.1093/nar/gkz991.

MiPepid: MicroPeptide identification tool using machine learning.MiPepid：基于机器学习的微肽鉴定工具。

BMC Bioinformatics. 2019 Nov 8;20(1):559. doi: 10.1186/s12859-019-3033-9.

Large-Scale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes.大规模人类微生物组分析揭示了数千个小型新基因。

Cell. 2019 Aug 22;178(5):1245-1259.e14. doi: 10.1016/j.cell.2019.07.016. Epub 2019 Aug 8.

YshB Promotes Intracellular Replication and Is Required for Virulence.YshB 促进细胞内复制，是毒力所必需的。

J Bacteriol. 2019 Aug 8;201(17). doi: 10.1128/JB.00314-19. Print 2019 Sep 1.

Identifying Small Proteins by Ribosome Profiling with Stalled Initiation Complexes.通过核糖体 profiling 技术鉴定起始复合物停滞的小蛋白。

mBio. 2019 Mar 5;10(2):e02819-18. doi: 10.1128/mBio.02819-18.

Unraveling the hidden universe of small proteins in bacterial genomes.揭示细菌基因组中小蛋白的隐藏宇宙。

Mol Syst Biol. 2019 Feb 22;15(2):e8290. doi: 10.15252/msb.20188290.

CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction.CNN-MGP：用于宏基因组基因预测的卷积神经网络。

Interdiscip Sci. 2019 Dec;11(4):628-635. doi: 10.1007/s12539-018-0313-4. Epub 2018 Dec 27.

A primer on deep learning in genomics.深度学习在基因组学中的应用简介。

Nat Genet. 2019 Jan;51(1):12-18. doi: 10.1038/s41588-018-0295-5. Epub 2018 Nov 26.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验