Suppr超能文献

csORF-finder:一种用于准确识别多物种编码短开放阅读框的有效集成学习框架。

csORF-finder: an effective ensemble learning framework for accurate identification of multi-species coding short open reading frames.

机构信息

Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China.

Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.

出版信息

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac392.

Abstract

Short open reading frames (sORFs) refer to the small nucleic fragments no longer than 303 nt in length that probably encode small peptides. To date, translatable sORFs have been found in both untranslated regions of messenger ribonucleic acids (RNAs; mRNAs) and long non-coding RNAs (lncRNAs), playing vital roles in a myriad of biological processes. As not all sORFs are translated or essentially translatable, it is important to develop a highly accurate computational tool for characterizing the coding potential of sORFs, thereby facilitating discovery of novel functional peptides. In light of this, we designed a series of ensemble models by integrating Efficient-CapsNet and LightGBM, collectively termed csORF-finder, to differentiate the coding sORFs (csORFs) from non-coding sORFs in Homo sapiens, Mus musculus and Drosophila melanogaster, respectively. To improve the performance of csORF-finder, we introduced a novel feature encoding scheme named trinucleotide deviation from expected mean (TDE) and computed all types of in-frame sequence-based features, such as i-framed-3mer, i-framed-CKSNAP and i-framed-TDE. Benchmarking results showed that these features could significantly boost the performance compared to the original 3-mer, CKSNAP and TDE features. Our performance comparisons showed that csORF-finder achieved a superior performance than the state-of-the-art methods for csORF prediction on multi-species and non-ATG initiation independent test datasets. Furthermore, we applied csORF-finder to screen the lncRNA datasets for identifying potential csORFs. The resulting data serve as an important computational repository for further experimental validation. We hope that csORF-finder can be exploited as a powerful platform for high-throughput identification of csORFs and functional characterization of these csORFs encoded peptides.

摘要

短开放阅读框(sORFs)是指长度不超过 303nt 的小核酸片段,可能编码小肽。迄今为止,已在信使核糖核酸(mRNA)和长非编码 RNA(lncRNA)的非翻译区中发现了可翻译的 sORFs,它们在众多生物过程中发挥着重要作用。由于并非所有 sORFs 都被翻译或本质上可翻译,因此开发一种高度准确的计算工具来描述 sORFs 的编码潜力非常重要,从而有助于发现新的功能肽。有鉴于此,我们设计了一系列集成 Efficient-CapsNet 和 LightGBM 的集成模型,统称为 csORF-finder,分别用于区分人类、小鼠和果蝇中的编码 sORFs(csORFs)和非编码 sORFs。为了提高 csORF-finder 的性能,我们引入了一种新的特征编码方案,称为三核苷酸偏离预期均值(TDE),并计算了所有类型的基于框架的序列特征,如 i-framed-3mer、i-framed-CKSNAP 和 i-framed-TDE。基准测试结果表明,与原始的 3mer、CKSNAP 和 TDE 特征相比,这些特征可以显著提高性能。我们的性能比较表明,csORF-finder 在多物种和非 ATG 起始独立测试数据集上的 csORF 预测方面优于最新方法。此外,我们应用 csORF-finder 筛选 lncRNA 数据集,以识别潜在的 csORFs。由此产生的数据为进一步的实验验证提供了一个重要的计算资源库。我们希望 csORF-finder 可以作为一个强大的平台,用于高通量识别 csORFs 并对这些 csORFs 编码的肽进行功能表征。

相似文献

4
In-depth characterization and identification of translatable lncRNAs.深入分析和鉴定可翻译的长链非编码 RNA。
Comput Biol Med. 2023 Sep;164:107243. doi: 10.1016/j.compbiomed.2023.107243. Epub 2023 Jul 8.
7
Mining for missed sORF-encoded peptides.挖掘缺失的短开放阅读框编码肽。
Expert Rev Proteomics. 2019 Mar;16(3):257-266. doi: 10.1080/14789450.2019.1571919. Epub 2019 Feb 13.
8

引用本文的文献

7
No country for old methods: New tools for studying microproteins.旧方法的时代不再:研究微蛋白的新工具
iScience. 2024 Jan 20;27(2):108972. doi: 10.1016/j.isci.2024.108972. eCollection 2024 Feb 16.

本文引用的文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验