SORFPP：在验证数据集上基于融合框架增强丰富的序列驱动信息以识别SEP

SORFPP: Enhancing rich sequence-driven information to identify SEPs based on fused framework on validation datasets.

作者信息

Feng Hongqi, Nie Qi, Yang Sen

机构信息

School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, China.

The Affiliated Changzhou No.2 People's Hospital of Nanjing Medical University, Changzhou, China.

出版信息

PLoS One. 2025 Apr 28;20(4):e0320314. doi: 10.1371/journal.pone.0320314. eCollection 2025.

DOI:10.1371/journal.pone.0320314

PMID:40294059

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12036913/

Abstract

BACKGROUND

Genome sequencing has enabled us to find functional peptides encoded by short open read frames (sORFs) in long non-coding RNAs (lncRNAs). sORFs-encoded peptides (SEPs) regulate gene expression, signaling, and so on and have significant roles, unlike common peptides. Various computational methods have been proposed. However, there is a lack of contributive features and effective models. Therefore, a high-throughput computational method to predict SEPs is needed.

RESULTS

We propose a computational method, SORFPP, to predict SEPs by mining feature information from multiple perspectives in an experimentally validated dataset from TranLnc. SORFPP fully extracts SEP sequence information using the protein language model ESM-2 and curated traditional encoding, including QSOrder, k-mer, etc. SORFPP uses CatBoost to solve the sparsity problem of traditional encoding. SORFPP also analyzes ESM-2 pre-training characterization information with the Self-attention model. Finally, an ensemble learning framework combines the two models and their results are fed into Logistic Regression model for accurate and robust predictions. For comparison, SORFPP outperforms other state-of-the-art models in Matthew correlation coefficient by 12.2%-24.2% on three benchmark datasets.

CONCLUSION

Integrating the ensemble learning strategy with contributive traditional features and the protein language encoding methods shows better performance. Datasets and codes are accessible at https://doi.org/10.6084/m9.figshare.28079897 and http://111.229.198.94:5000/.

摘要

背景

基因组测序使我们能够在长链非编码RNA（lncRNA）中找到由短开放阅读框（sORF）编码的功能性肽段。与普通肽段不同，sORF编码的肽段（SEP）可调节基因表达、信号传导等，具有重要作用。人们已经提出了各种计算方法。然而，目前缺乏有贡献的特征和有效的模型。因此，需要一种高通量计算方法来预测SEP。

结果

我们提出了一种计算方法SORFPP，通过在来自TranLnc的经过实验验证的数据集中从多个角度挖掘特征信息来预测SEP。SORFPP使用蛋白质语言模型ESM-2和精心整理的传统编码（包括QSOrder、k-mer等）充分提取SEP序列信息。SORFPP使用CatBoost解决传统编码的稀疏性问题。SORFPP还使用自注意力模型分析ESM-2预训练表征信息。最后，一个集成学习框架将这两个模型结合起来，并将其结果输入逻辑回归模型进行准确且稳健的预测。为作比较，在三个基准数据集上，SORFPP在马修相关系数方面比其他现有最佳模型高出12.2%-24.2%。