Li Haibin, Meng Jun, Wang Zhaowei, Luan Yushi
School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China.
School of Bioengineering, Dalian University of Technology, Dalian, 116024, China.
Interdiscip Sci. 2025 Mar;17(1):114-133. doi: 10.1007/s12539-024-00661-8. Epub 2024 Oct 14.
The primary microRNAs (pri-miRNAs) have been observed to contain translatable small open reading frames (sORFs) that can encode peptides as an independent element. Relevant studies have proven that those of sORFs are of significance in regulating the expression of biological traits. The existing methods for predicting the coding potential of sORFs frequently overlook this data or categorize them as negative samples, impeding the identification of additional translatable sORFs in pri-miRNAs. In light of this, a novel method named misORFPred has been proposed. Specifically, an enhanced scalable k-mer (ESKmer) that simultaneously integrates the composition information within a sequence and distance information between sequences is designed to extract the nucleotide sequence features. After feature selection, the optimal features and several machine learning classifiers are combined to construct the ensemble model, where a newly devised dynamic ensemble voting strategy (DEVS) is proposed to dynamically adjust the weights of base classifiers and adaptively select the optimal base classifiers for each unlabeled sample. Cross-validation results suggest that ESKmer and DEVS are essential for this classification task and could boost model performance. Independent testing results indicate that misORFPred outperforms the state-of-the-art methods. Furthermore, we execute misORFPerd on the genomes of various plant species and perform a thorough analysis of the predicted outcomes. Taken together, misORFPred is a powerful tool for identifying the translatable sORFs in plant pri-miRNAs and can provide highly trusted candidates for subsequent biological experiments.
已观察到初级微小RNA(pri-miRNA)含有可编码肽的可翻译小开放阅读框(sORF),这些sORF可作为独立元件发挥作用。相关研究已证明,这些sORF在调节生物学性状的表达方面具有重要意义。现有的预测sORF编码潜力的方法常常忽略这些数据或将其归类为阴性样本,这阻碍了在pri-miRNA中识别更多可翻译的sORF。鉴于此,提出了一种名为misORFPred的新方法。具体而言,设计了一种增强型可扩展k-mer(ESKmer),它同时整合了序列内的组成信息和序列间的距离信息,用于提取核苷酸序列特征。经过特征选择后,将最优特征与几个机器学习分类器相结合构建集成模型,其中提出了一种新设计的动态集成投票策略(DEVS),用于动态调整基分类器的权重,并为每个未标记样本自适应选择最优基分类器。交叉验证结果表明,ESKmer和DEVS对于该分类任务至关重要,并且可以提高模型性能。独立测试结果表明,misORFPred优于现有最先进的方法。此外,我们在各种植物物种的基因组上运行misORFPerd,并对预测结果进行了全面分析。综上所述,misORFPred是一种用于识别植物pri-miRNA中可翻译sORF的强大工具,可为后续生物学实验提供高度可信的候选对象。