Department of Genetics, Stanford University, Stanford, CA 94305, USA; Department of Medicine (Hematology, Blood and Marrow Transplantation), Stanford University, Stanford, CA 94305, USA.
Department of Genetics, Stanford University, Stanford, CA 94305, USA; Department of Medicine (Hematology, Blood and Marrow Transplantation), Stanford University, Stanford, CA 94305, USA.
Cell Host Microbe. 2021 Jan 13;29(1):121-131.e4. doi: 10.1016/j.chom.2020.11.002. Epub 2020 Dec 7.
Small open reading frames (smORFs) and their encoded microproteins play central roles in microbes. However, there is a vast unexplored space of smORFs within human-associated microbes. A recent bioinformatic analysis used evolutionary conservation signals to enhance prediction of small protein families. To facilitate the annotation of specific smORFs, we introduce SmORFinder. This tool combines profile hidden Markov models of each smORF family and deep learning models that better generalize to smORF families not seen in the training set, resulting in predictions enriched for Ribo-seq translation signals. Feature importance analysis reveals that the deep learning models learn to identify Shine-Dalgarno sequences, deprioritize the wobble position in each codon, and group codon synonyms found in the codon table. A core-genome analysis of 26 bacterial species identifies several core smORFs of unknown function. We pre-compute smORF annotations for thousands of RefSeq isolate genomes and Human Microbiome Project metagenomes and provide these data through a public web portal.
小开放阅读框(smORFs)及其编码的微蛋白在微生物中起着核心作用。然而,在与人类相关的微生物中,仍有大量尚未探索的 smORFs。最近的一项生物信息学分析利用进化保守信号来增强对小蛋白家族的预测。为了方便特定 smORF 的注释,我们引入了 SmORFinder。该工具结合了每个 smORF 家族的轮廓隐马尔可夫模型和能够更好地泛化到训练集中未见过的 smORF 家族的深度学习模型,从而使预测结果富含核糖体测序翻译信号。特征重要性分析表明,深度学习模型学会了识别 Shine-Dalgarno 序列,降低每个密码子中摆动位置的优先级,并将密码子表中发现的密码子同义词分组。对 26 种细菌物种的核心基因组分析确定了几个未知功能的核心 smORFs。我们为数千个 RefSeq 分离基因组和人类微生物组计划宏基因组预先计算了 smORF 注释,并通过公共网络门户提供这些数据。