Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China.
Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac392.
Short open reading frames (sORFs) refer to the small nucleic fragments no longer than 303 nt in length that probably encode small peptides. To date, translatable sORFs have been found in both untranslated regions of messenger ribonucleic acids (RNAs; mRNAs) and long non-coding RNAs (lncRNAs), playing vital roles in a myriad of biological processes. As not all sORFs are translated or essentially translatable, it is important to develop a highly accurate computational tool for characterizing the coding potential of sORFs, thereby facilitating discovery of novel functional peptides. In light of this, we designed a series of ensemble models by integrating Efficient-CapsNet and LightGBM, collectively termed csORF-finder, to differentiate the coding sORFs (csORFs) from non-coding sORFs in Homo sapiens, Mus musculus and Drosophila melanogaster, respectively. To improve the performance of csORF-finder, we introduced a novel feature encoding scheme named trinucleotide deviation from expected mean (TDE) and computed all types of in-frame sequence-based features, such as i-framed-3mer, i-framed-CKSNAP and i-framed-TDE. Benchmarking results showed that these features could significantly boost the performance compared to the original 3-mer, CKSNAP and TDE features. Our performance comparisons showed that csORF-finder achieved a superior performance than the state-of-the-art methods for csORF prediction on multi-species and non-ATG initiation independent test datasets. Furthermore, we applied csORF-finder to screen the lncRNA datasets for identifying potential csORFs. The resulting data serve as an important computational repository for further experimental validation. We hope that csORF-finder can be exploited as a powerful platform for high-throughput identification of csORFs and functional characterization of these csORFs encoded peptides.
短开放阅读框(sORFs)是指长度不超过 303nt 的小核酸片段,可能编码小肽。迄今为止,已在信使核糖核酸(mRNA)和长非编码 RNA(lncRNA)的非翻译区中发现了可翻译的 sORFs,它们在众多生物过程中发挥着重要作用。由于并非所有 sORFs 都被翻译或本质上可翻译,因此开发一种高度准确的计算工具来描述 sORFs 的编码潜力非常重要,从而有助于发现新的功能肽。有鉴于此,我们设计了一系列集成 Efficient-CapsNet 和 LightGBM 的集成模型,统称为 csORF-finder,分别用于区分人类、小鼠和果蝇中的编码 sORFs(csORFs)和非编码 sORFs。为了提高 csORF-finder 的性能,我们引入了一种新的特征编码方案,称为三核苷酸偏离预期均值(TDE),并计算了所有类型的基于框架的序列特征,如 i-framed-3mer、i-framed-CKSNAP 和 i-framed-TDE。基准测试结果表明,与原始的 3mer、CKSNAP 和 TDE 特征相比,这些特征可以显著提高性能。我们的性能比较表明,csORF-finder 在多物种和非 ATG 起始独立测试数据集上的 csORF 预测方面优于最新方法。此外,我们应用 csORF-finder 筛选 lncRNA 数据集,以识别潜在的 csORFs。由此产生的数据为进一步的实验验证提供了一个重要的计算资源库。我们希望 csORF-finder 可以作为一个强大的平台,用于高通量识别 csORFs 并对这些 csORFs 编码的肽进行功能表征。