• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ShortStop:一种用于微小蛋白质发现的机器学习框架。

ShortStop: a machine learning framework for microprotein discovery.

作者信息

Miller Brendan, de Souza Eduardo Vieira, Pai Victor J, Kim Hosung, Vaughan Joan M, Lau Calvin J, Diedrich Jolene K, Saghatelian Alan

机构信息

Clayton Foundation Laboratories for Peptide Biology, The Salk Institute for Biological Studies, 10010 N Torrey Pines Rd, San Diego, CA USA.

USC Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA USA.

出版信息

BMC Methods. 2025;2(1):16. doi: 10.1186/s44330-025-00037-4. Epub 2025 Aug 1.

DOI:10.1186/s44330-025-00037-4
PMID:40756675
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12313729/
Abstract

BACKGROUND

The human genome contains over 3 million small open reading frames (smORFs, 150 codons). Ribosome profiling and proteogenomics transformed our understanding of these sequences by showing that thousands are actively translated, and hundreds produce detectable peptides by mass spectrometry. However, the random arrangement of codons across the 3-gigabase human genome naturally generates smORFs by chance, suggesting many may represent translational noise or regulatory elements rather than functional proteins. This is supported by the fact that most translating smORFs occur in upstream open reading frames (uORFs), which typically regulate translation of canonical coding sequences rather than encode bioactive microproteins. As interest grows in uncovering biologically meaningful microproteins, a key challenge remains: distinguishing functional smORFs from non-functional or regulatory translation products. Although empirical methods such as individual microprotein studies or large-scale screens can help, these approaches are time-consuming, expensive, and come with technical limitations. New complementary strategies are needed.

METHODS

To address this challenge, we developed ShortStop, a computational framework based on the idea that not all translating smORFs produce functional proteins, but the ones that do may resemble experimentally characterized microproteins. ShortStop classifies smORFs into two reference groups: Swiss-Prot Analog Microproteins (SAMs), which resemble known microproteins, and PRISMs (Physicochemically Resembling In Silico Microproteins), which are synthetic sequences designed to match the composition of translating smORFs but lacking sequence order or evolutionary selection, and therefore serving as a proxy for non-functional peptides. This two-class system enables machine learning to help prioritize smORFs for downstream study.

RESULTS

ShortStop achieved high precision (90-94%), recall (87-96%), and F1 scores (90-93%) across all classes. When applied to a published dataset of translating smORFs, ShortStop classified about 8% as candidates with biochemical properties resembling Swiss-Prot microproteins (i.e., called SAMs). The remaining 92% resembled in silico generated sequences (i.e., called PRISMs), representing noncanonical proteins, non-functional peptides, or regulatory translation events. SAMs showed lower C-terminal hydrophobicity-linked to reduced proteasomal degradation-and greater N-terminal hydrophilicity at neutral pH, suggesting improved solubility and intracellular stability. ShortStop also identified microproteins overlooked by other methods, including one encoded by an upstream overlapping smORF in the StAR gene, which was detectable in human cells and steroid-producing tissues. In a clinical lung cancer dataset, ShortStop uncovered differentially expressed microprotein candidates, several of which were validated by mass spectrometry.

DISCUSSION

ShortStop addresses a key gap in microprotein research-the lack of scalable tools to characterize microproteins and standardized negative training data to train machine learning models for microproteins. By providing a classification framework rooted in biochemical features, ShortStop offers a practical solution for targeting smORFs in functional studies, benchmarking new discovery tools, and advancing microprotein research.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1186/s44330-025-00037-4.

摘要

背景

人类基因组包含超过300万个小开放阅读框(smORF,长度小于150个密码子)。核糖体谱分析和蛋白质基因组学改变了我们对这些序列的理解,表明其中数千个smORF正在被积极翻译,数百个通过质谱法可检测到肽段。然而,在30亿碱基对的人类基因组中密码子的随机排列自然会偶然产生smORF,这表明许多可能代表翻译噪声或调控元件,而非功能性蛋白质。这一观点得到以下事实的支持:大多数正在翻译的smORF出现在上游开放阅读框(uORF)中,其通常调控经典编码序列的翻译,而非编码生物活性微蛋白。随着人们对揭示具有生物学意义的微蛋白的兴趣日益增加,一个关键挑战依然存在:区分功能性smORF与非功能性或调控性翻译产物。尽管诸如个别微蛋白研究或大规模筛选等实证方法可能有所帮助,但这些方法耗时、昂贵且存在技术局限性。因此需要新的互补策略。

方法

为应对这一挑战,我们开发了ShortStop,这是一个计算框架,其基于并非所有正在翻译的smORF都会产生功能性蛋白质这一理念,但产生功能性蛋白质的smORF可能类似于实验表征的微蛋白。ShortStop将smORF分为两个参考组:瑞士蛋白质数据库类似微蛋白(SAM),其类似于已知微蛋白;以及PRISM(物理化学性质类似于计算机模拟微蛋白),其为合成序列,旨在匹配正在翻译的smORF的组成,但缺乏序列顺序或进化选择,因此用作非功能性肽段的替代物。这种两类系统使机器学习能够帮助为下游研究对smORF进行优先级排序。

结果

ShortStop在所有类别中均实现了高精度(90 - 94%)、召回率(87 - 96%)和F1分数(90 - 93%)。当应用于已发表的正在翻译的smORF数据集时,ShortStop将约8%分类为具有类似于瑞士蛋白质数据库微蛋白生化特性的候选物(即称为SAM)。其余92%类似于计算机模拟生成的序列(即称为PRISM),代表非经典蛋白质、非功能性肽段或调控性翻译事件。SAM显示出较低的C端疏水性(与蛋白酶体降解减少相关)以及在中性pH下更大的N端亲水性,表明其溶解性和细胞内稳定性有所改善。ShortStop还鉴定出其他方法遗漏的微蛋白,包括由类固醇生成急性调节蛋白(StAR)基因中的上游重叠smORF编码的一种微蛋白,其在人类细胞和类固醇生成组织中可检测到。在一个临床肺癌数据集中,ShortStop发现了差异表达的微蛋白候选物,其中几种通过质谱法得到验证。

讨论

ShortStop解决了微蛋白研究中的一个关键空白——缺乏可扩展工具来表征微蛋白以及缺乏标准化的阴性训练数据来训练微蛋白机器学习模型。通过提供一个基于生化特征的分类框架,ShortStop为在功能研究中靶向smORF、对新发现工具进行基准测试以及推进微蛋白研究提供了一个切实可行的解决方案。

补充信息

在线版本包含可在10.1186/s44330 - 025 - 00037 - 4获取的补充材料。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/61e83830855b/44330_2025_37_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/715eb8c82414/44330_2025_37_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/74a9c6c21d35/44330_2025_37_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/0eff2e5e6c4d/44330_2025_37_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/010066cffb2c/44330_2025_37_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/190fe1e557b1/44330_2025_37_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/61e83830855b/44330_2025_37_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/715eb8c82414/44330_2025_37_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/74a9c6c21d35/44330_2025_37_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/0eff2e5e6c4d/44330_2025_37_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/010066cffb2c/44330_2025_37_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/190fe1e557b1/44330_2025_37_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac96/12313729/61e83830855b/44330_2025_37_Fig6_HTML.jpg

相似文献

1
ShortStop: a machine learning framework for microprotein discovery.ShortStop:一种用于微小蛋白质发现的机器学习框架。
BMC Methods. 2025;2(1):16. doi: 10.1186/s44330-025-00037-4. Epub 2025 Aug 1.
2
Short-Term Memory Impairment短期记忆障碍
3
The Black Book of Psychotropic Dosing and Monitoring.《精神药物剂量与监测黑皮书》
Psychopharmacol Bull. 2024 Jul 8;54(3):8-59.
4
A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.对紫杉醇、多西他赛、吉西他滨和长春瑞滨在非小细胞肺癌中的临床疗效和成本效益进行的快速系统评价。
Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.
5
Sexual Harassment and Prevention Training性骚扰与预防培训
6
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
7
Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施:系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。
Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.
8
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
9
Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.降低男男性行为者中艾滋病毒性传播风险的行为干预措施。
Cochrane Database Syst Rev. 2008 Jul 16(3):CD001230. doi: 10.1002/14651858.CD001230.pub2.
10
Systemic treatments for metastatic cutaneous melanoma.转移性皮肤黑色素瘤的全身治疗
Cochrane Database Syst Rev. 2018 Feb 6;2(2):CD011123. doi: 10.1002/14651858.CD011123.pub2.

本文引用的文献

1
The human and non-human primate developmental GTEx projects.人类和非人类灵长类动物发育GTEx项目。
Nature. 2025 Jan;637(8046):557-564. doi: 10.1038/s41586-024-08244-9. Epub 2025 Jan 15.
2
Ensembl 2025.Ensembl 2025。
Nucleic Acids Res. 2025 Jan 6;53(D1):D948-D957. doi: 10.1093/nar/gkae1071.
3
UniProt: the Universal Protein Knowledgebase in 2025.通用蛋白质知识库(UniProt):2025年的情况
Nucleic Acids Res. 2025 Jan 6;53(D1):D609-D617. doi: 10.1093/nar/gkae1010.
4
Rp3: Ribosome profiling-assisted proteogenomics improves coverage and confidence during microprotein discovery.Rp3:核糖体图谱辅助的蛋白质基因组学在微蛋白发现过程中提高了覆盖度和可信度。
Nat Commun. 2024 Aug 9;15(1):6839. doi: 10.1038/s41467-024-50301-4.
5
Discovery of antimicrobial peptides in the global microbiome with machine learning.利用机器学习在全球微生物组中发现抗菌肽。
Cell. 2024 Jul 11;187(14):3761-3778.e16. doi: 10.1016/j.cell.2024.05.013. Epub 2024 Jun 5.
6
Comparison of software packages for detecting unannotated translated small open reading frames by Ribo-seq.通过 Ribo-seq 检测未注释翻译的小开放阅读框的软件包比较。
Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae268.
7
Widespread stable noncanonical peptides identified by integrated analyses of ribosome profiling and ORF features.通过核糖体谱分析和开放阅读框特征的综合分析鉴定出的广泛存在的稳定非经典肽段。
Nat Commun. 2024 Mar 2;15(1):1932. doi: 10.1038/s41467-024-46240-9.
8
No country for old methods: New tools for studying microproteins.旧方法的时代不再:研究微蛋白的新工具
iScience. 2024 Jan 20;27(2):108972. doi: 10.1016/j.isci.2024.108972. eCollection 2024 Feb 16.
9
Translation of non-canonical open reading frames as a cancer cell survival mechanism in childhood medulloblastoma.非规范开放阅读框的翻译作为儿童髓母细胞瘤的一种癌细胞存活机制
Mol Cell. 2024 Jan 18;84(2):261-276.e18. doi: 10.1016/j.molcel.2023.12.003. Epub 2024 Jan 3.
10
Machine learning-based approaches for ubiquitination site prediction in human proteins.基于机器学习的人类蛋白质泛素化位点预测方法。
BMC Bioinformatics. 2023 Nov 28;24(1):449. doi: 10.1186/s12859-023-05581-w.