ShortStop：一种用于微小蛋白质发现的机器学习框架。

ShortStop: a machine learning framework for microprotein discovery.

作者信息

Miller Brendan, de Souza Eduardo Vieira, Pai Victor J, Kim Hosung, Vaughan Joan M, Lau Calvin J, Diedrich Jolene K, Saghatelian Alan

机构信息

Clayton Foundation Laboratories for Peptide Biology, The Salk Institute for Biological Studies, 10010 N Torrey Pines Rd, San Diego, CA USA.

USC Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA USA.

出版信息

BMC Methods. 2025;2(1):16. doi: 10.1186/s44330-025-00037-4. Epub 2025 Aug 1.

DOI:10.1186/s44330-025-00037-4

PMID:40756675

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12313729/

Abstract

BACKGROUND

The human genome contains over 3 million small open reading frames (smORFs, 150 codons). Ribosome profiling and proteogenomics transformed our understanding of these sequences by showing that thousands are actively translated, and hundreds produce detectable peptides by mass spectrometry. However, the random arrangement of codons across the 3-gigabase human genome naturally generates smORFs by chance, suggesting many may represent translational noise or regulatory elements rather than functional proteins. This is supported by the fact that most translating smORFs occur in upstream open reading frames (uORFs), which typically regulate translation of canonical coding sequences rather than encode bioactive microproteins. As interest grows in uncovering biologically meaningful microproteins, a key challenge remains: distinguishing functional smORFs from non-functional or regulatory translation products. Although empirical methods such as individual microprotein studies or large-scale screens can help, these approaches are time-consuming, expensive, and come with technical limitations. New complementary strategies are needed.

METHODS

To address this challenge, we developed ShortStop, a computational framework based on the idea that not all translating smORFs produce functional proteins, but the ones that do may resemble experimentally characterized microproteins. ShortStop classifies smORFs into two reference groups: Swiss-Prot Analog Microproteins (SAMs), which resemble known microproteins, and PRISMs (Physicochemically Resembling In Silico Microproteins), which are synthetic sequences designed to match the composition of translating smORFs but lacking sequence order or evolutionary selection, and therefore serving as a proxy for non-functional peptides. This two-class system enables machine learning to help prioritize smORFs for downstream study.

RESULTS

ShortStop achieved high precision (90-94%), recall (87-96%), and F1 scores (90-93%) across all classes. When applied to a published dataset of translating smORFs, ShortStop classified about 8% as candidates with biochemical properties resembling Swiss-Prot microproteins (i.e., called SAMs). The remaining 92% resembled in silico generated sequences (i.e., called PRISMs), representing noncanonical proteins, non-functional peptides, or regulatory translation events. SAMs showed lower C-terminal hydrophobicity-linked to reduced proteasomal degradation-and greater N-terminal hydrophilicity at neutral pH, suggesting improved solubility and intracellular stability. ShortStop also identified microproteins overlooked by other methods, including one encoded by an upstream overlapping smORF in the StAR gene, which was detectable in human cells and steroid-producing tissues. In a clinical lung cancer dataset, ShortStop uncovered differentially expressed microprotein candidates, several of which were validated by mass spectrometry.

DISCUSSION

ShortStop addresses a key gap in microprotein research-the lack of scalable tools to characterize microproteins and standardized negative training data to train machine learning models for microproteins. By providing a classification framework rooted in biochemical features, ShortStop offers a practical solution for targeting smORFs in functional studies, benchmarking new discovery tools, and advancing microprotein research.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1186/s44330-025-00037-4.

摘要

背景

人类基因组包含超过300万个小开放阅读框（smORF，长度小于150个密码子）。核糖体谱分析和蛋白质基因组学改变了我们对这些序列的理解，表明其中数千个smORF正在被积极翻译，数百个通过质谱法可检测到肽段。然而，在30亿碱基对的人类基因组中密码子的随机排列自然会偶然产生smORF，这表明许多可能代表翻译噪声或调控元件，而非功能性蛋白质。这一观点得到以下事实的支持：大多数正在翻译的smORF出现在上游开放阅读框（uORF）中，其通常调控经典编码序列的翻译，而非编码生物活性微蛋白。随着人们对揭示具有生物学意义的微蛋白的兴趣日益增加，一个关键挑战依然存在：区分功能性smORF与非功能性或调控性翻译产物。尽管诸如个别微蛋白研究或大规模筛选等实证方法可能有所帮助，但这些方法耗时、昂贵且存在技术局限性。因此需要新的互补策略。

方法

为应对这一挑战，我们开发了ShortStop，这是一个计算框架，其基于并非所有正在翻译的smORF都会产生功能性蛋白质这一理念，但产生功能性蛋白质的smORF可能类似于实验表征的微蛋白。ShortStop将smORF分为两个参考组：瑞士蛋白质数据库类似微蛋白（SAM），其类似于已知微蛋白；以及PRISM（物理化学性质类似于计算机模拟微蛋白），其为合成序列，旨在匹配正在翻译的smORF的组成，但缺乏序列顺序或进化选择，因此用作非功能性肽段的替代物。这种两类系统使机器学习能够帮助为下游研究对smORF进行优先级排序。

结果

ShortStop在所有类别中均实现了高精度（90 - 94%）、召回率（87 - 96%）和F1分数（90 - 93%）。当应用于已发表的正在翻译的smORF数据集时，ShortStop将约8%分类为具有类似于瑞士蛋白质数据库微蛋白生化特性的候选物（即称为SAM）。其余92%类似于计算机模拟生成的序列（即称为PRISM），代表非经典蛋白质、非功能性肽段或调控性翻译事件。SAM显示出较低的C端疏水性（与蛋白酶体降解减少相关）以及在中性pH下更大的N端亲水性，表明其溶解性和细胞内稳定性有所改善。ShortStop还鉴定出其他方法遗漏的微蛋白，包括由类固醇生成急性调节蛋白（StAR）基因中的上游重叠smORF编码的一种微蛋白，其在人类细胞和类固醇生成组织中可检测到。在一个临床肺癌数据集中，ShortStop发现了差异表达的微蛋白候选物，其中几种通过质谱法得到验证。