csORF-finder：一种用于准确识别多物种编码短开放阅读框的有效集成学习框架。

csORF-finder: an effective ensemble learning framework for accurate identification of multi-species coding short open reading frames.

机构信息

Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China.

Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.

出版信息

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac392.

DOI:10.1093/bib/bbac392

PMID:36094083

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9677467/

Abstract

Short open reading frames (sORFs) refer to the small nucleic fragments no longer than 303 nt in length that probably encode small peptides. To date, translatable sORFs have been found in both untranslated regions of messenger ribonucleic acids (RNAs; mRNAs) and long non-coding RNAs (lncRNAs), playing vital roles in a myriad of biological processes. As not all sORFs are translated or essentially translatable, it is important to develop a highly accurate computational tool for characterizing the coding potential of sORFs, thereby facilitating discovery of novel functional peptides. In light of this, we designed a series of ensemble models by integrating Efficient-CapsNet and LightGBM, collectively termed csORF-finder, to differentiate the coding sORFs (csORFs) from non-coding sORFs in Homo sapiens, Mus musculus and Drosophila melanogaster, respectively. To improve the performance of csORF-finder, we introduced a novel feature encoding scheme named trinucleotide deviation from expected mean (TDE) and computed all types of in-frame sequence-based features, such as i-framed-3mer, i-framed-CKSNAP and i-framed-TDE. Benchmarking results showed that these features could significantly boost the performance compared to the original 3-mer, CKSNAP and TDE features. Our performance comparisons showed that csORF-finder achieved a superior performance than the state-of-the-art methods for csORF prediction on multi-species and non-ATG initiation independent test datasets. Furthermore, we applied csORF-finder to screen the lncRNA datasets for identifying potential csORFs. The resulting data serve as an important computational repository for further experimental validation. We hope that csORF-finder can be exploited as a powerful platform for high-throughput identification of csORFs and functional characterization of these csORFs encoded peptides.

摘要

短开放阅读框（sORFs）是指长度不超过 303nt 的小核酸片段，可能编码小肽。迄今为止，已在信使核糖核酸（mRNA）和长非编码 RNA（lncRNA）的非翻译区中发现了可翻译的 sORFs，它们在众多生物过程中发挥着重要作用。由于并非所有 sORFs 都被翻译或本质上可翻译，因此开发一种高度准确的计算工具来描述 sORFs 的编码潜力非常重要，从而有助于发现新的功能肽。有鉴于此，我们设计了一系列集成 Efficient-CapsNet 和 LightGBM 的集成模型，统称为 csORF-finder，分别用于区分人类、小鼠和果蝇中的编码 sORFs（csORFs）和非编码 sORFs。为了提高 csORF-finder 的性能，我们引入了一种新的特征编码方案，称为三核苷酸偏离预期均值（TDE），并计算了所有类型的基于框架的序列特征，如 i-framed-3mer、i-framed-CKSNAP 和 i-framed-TDE。基准测试结果表明，与原始的 3mer、CKSNAP 和 TDE 特征相比，这些特征可以显著提高性能。我们的性能比较表明，csORF-finder 在多物种和非 ATG 起始独立测试数据集上的 csORF 预测方面优于最新方法。此外，我们应用 csORF-finder 筛选 lncRNA 数据集，以识别潜在的 csORFs。由此产生的数据为进一步的实验验证提供了一个重要的计算资源库。我们希望 csORF-finder 可以作为一个强大的平台，用于高通量识别 csORFs 并对这些 csORFs 编码的肽进行功能表征。

相似文献

csORF-finder: an effective ensemble learning framework for accurate identification of multi-species coding short open reading frames.csORF-finder：一种用于准确识别多物种编码短开放阅读框的有效集成学习框架。

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac392.

sORFPred: A Method Based on Comprehensive Features and Ensemble Learning to Predict the sORFs in Plant LncRNAs.sORFPred：一种基于综合特征和集成学习的预测植物长链非编码RNA中短开放阅读框的方法。

Interdiscip Sci. 2023 Jun;15(2):189-201. doi: 10.1007/s12539-023-00552-4. Epub 2023 Jan 27.

Identification of small open reading frames in plant lncRNA using class-imbalance learning.利用不平衡学习识别植物 lncRNA 中的小开放阅读框。

Comput Biol Med. 2023 May;157:106773. doi: 10.1016/j.compbiomed.2023.106773. Epub 2023 Mar 11.

In-depth characterization and identification of translatable lncRNAs.深入分析和鉴定可翻译的长链非编码 RNA。

Comput Biol Med. 2023 Sep;164:107243. doi: 10.1016/j.compbiomed.2023.107243. Epub 2023 Jul 8.

LncRNA-Encoded Short Peptides Identification Using Feature Subset Recombination and Ensemble Learning.基于特征子集重组与集成学习的长链非编码RNA编码短肽鉴定

Interdiscip Sci. 2022 Mar;14(1):101-112. doi: 10.1007/s12539-021-00464-1. Epub 2021 Jul 25.

misORFPred: A Novel Method to Mine Translatable sORFs in Plant Pri-miRNAs Using Enhanced Scalable k-mer and Dynamic Ensemble Voting Strategy.misORFPred：一种利用增强型可扩展k-mer和动态集成投票策略挖掘植物初级微小RNA中可翻译小开放阅读框的新方法。

Interdiscip Sci. 2025 Mar;17(1):114-133. doi: 10.1007/s12539-024-00661-8. Epub 2024 Oct 14.

Mining for missed sORF-encoded peptides.挖掘缺失的短开放阅读框编码肽。

Expert Rev Proteomics. 2019 Mar;16(3):257-266. doi: 10.1080/14789450.2019.1571919. Epub 2019 Feb 13.

Identifying LncRNA-Encoded Short Peptides Using Optimized Hybrid Features and Ensemble Learning.利用优化的混合特征和集成学习识别长链非编码RNA编码的短肽

IEEE/ACM Trans Comput Biol Bioinform. 2022 Sep-Oct;19(5):2873-2881. doi: 10.1109/TCBB.2021.3104288. Epub 2022 Oct 10.

Combining in silico prediction and ribosome profiling in a genome-wide search for novel putatively coding sORFs.在全基因组范围内搜索新的推定编码 sORFs 时，结合计算机预测和核糖体图谱分析。

BMC Genomics. 2013 Sep 23;14:648. doi: 10.1186/1471-2164-14-648.

Using AnABlast for intergenic sORF prediction in the Caenorhabditis elegans genome.使用 AnABlast 预测秀丽隐杆线虫基因组中的基因间 sORF。

Bioinformatics. 2020 Dec 8;36(19):4827-4832. doi: 10.1093/bioinformatics/btaa608.

引用本文的文献

SORFPP: Enhancing rich sequence-driven information to identify SEPs based on fused framework on validation datasets.SORFPP：在验证数据集上基于融合框架增强丰富的序列驱动信息以识别SEP

PLoS One. 2025 Apr 28;20(4):e0320314. doi: 10.1371/journal.pone.0320314. eCollection 2025.

Interdiscip Sci. 2025 Mar;17(1):114-133. doi: 10.1007/s12539-024-00661-8. Epub 2024 Oct 14.

PSPI: A deep learning approach for prokaryotic small protein identification.PSPI：一种用于原核小蛋白识别的深度学习方法。

Front Genet. 2024 Jul 10;15:1439423. doi: 10.3389/fgene.2024.1439423. eCollection 2024.

Current understanding of functional peptides encoded by lncRNA in cancer.目前对lncRNA编码的功能性肽在癌症中的理解。

Cancer Cell Int. 2024 Jul 19;24(1):252. doi: 10.1186/s12935-024-03446-7.

A survey of experimental and computational identification of small proteins.小蛋白的实验和计算鉴定综述。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae345.

sOCP: a framework predicting smORF coding potential based on TIS and in-frame features and effectively applied in the human genome.sOCP：一种基于 TIS 和框内特征预测 smORF 编码潜能的框架，并有效地应用于人类基因组。

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae147.

No country for old methods: New tools for studying microproteins.旧方法的时代不再：研究微蛋白的新工具

iScience. 2024 Jan 20;27(2):108972. doi: 10.1016/j.isci.2024.108972. eCollection 2024 Feb 16.

Molecular and functional characterization of the Drosophila melanogaster conserved smORFome.果蝇保守的 smORF 组的分子和功能特征。

Cell Rep. 2023 Nov 28;42(11):113311. doi: 10.1016/j.celrep.2023.113311. Epub 2023 Oct 26.

Clinical prospects and research strategies of long non-coding RNA encoding micropeptides.长非编码 RNA 编码小肽的临床前景与研究策略。

Zhejiang Da Xue Xue Bao Yi Xue Ban. 2023 Aug 25;52(4):397-405. doi: 10.3724/zdxbyxb-2023-0128.

本文引用的文献

Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction.原核生物和真核生物启动子预测的计算工具的批判性评估。

Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbab551.

Positive-unlabeled learning in bioinformatics and computational biology: a brief review.生物信息学和计算生物学中的正无标记学习：简要综述。

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab461.

Unannotated proteins expand the MHC-I-restricted immunopeptidome in cancer.无注释蛋白扩展了癌症中 MHC-I 限制的免疫肽组。

Nat Biotechnol. 2022 Feb;40(2):209-217. doi: 10.1038/s41587-021-01021-3. Epub 2021 Oct 18.

SmProt: A Reliable Repository with Comprehensive Annotation of Small Proteins Identified from Ribosome Profiling.SmProt：一个从核糖体图谱中鉴定的小蛋白进行全面注释的可靠数据库。

Genomics Proteomics Bioinformatics. 2021 Aug;19(4):602-610. doi: 10.1016/j.gpb.2021.09.002. Epub 2021 Sep 15.

STALLION: a stacking-based ensemble learning framework for prokaryotic lysine acetylation site prediction.STALLION：一种基于堆叠的集成学习框架，用于预测细菌赖氨酸乙酰化位点。

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab376.

A putative long noncoding RNA-encoded micropeptide maintains cellular homeostasis in pancreatic β cells.一种假定的长链非编码RNA编码的微肽维持胰腺β细胞的细胞稳态。

Mol Ther Nucleic Acids. 2021 Jul 16;26:307-320. doi: 10.1016/j.omtn.2021.06.027. eCollection 2021 Dec 3.

Comprehensive evaluation of protein-coding sORFs prediction based on a random sequence strategy.基于随机序列策略的蛋白质编码 sORFs 预测综合评估。

Front Biosci (Landmark Ed). 2021 Aug 30;26(8):272-278. doi: 10.52586/4943.

Efficient-CapsNet: capsule network with self-attention routing.高效胶囊网络：具有自注意力路由的胶囊网络。

Sci Rep. 2021 Jul 19;11(1):14634. doi: 10.1038/s41598-021-93977-0.

Most non-canonical proteins uniquely populate the proteome or immunopeptidome.大多数非规范蛋白是蛋白质组或免疫肽组所特有的。

Cell Rep. 2021 Mar 9;34(10):108815. doi: 10.1016/j.celrep.2021.108815.

RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences.RNAsamba：基于神经网络的RNA序列蛋白质编码潜力评估

NAR Genom Bioinform. 2020 Jan 13;2(1):lqz024. doi: 10.1093/nargab/lqz024. eCollection 2020 Mar.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验