Zhao Siyuan, Meng Jun, Kang Qiang, Luan Yushi
IEEE/ACM Trans Comput Biol Bioinform. 2022 Sep-Oct;19(5):2873-2881. doi: 10.1109/TCBB.2021.3104288. Epub 2022 Oct 10.
Long non-coding RNA (lncRNA) contains short open reading frames (sORFs), and sORFs-encoded short peptides (SEPs) have become the focus of scientific studies due to their crucial role in life activities. The identification of SEPs is vital to further understanding their regulatory function. Bioinformatics methods can quickly identify SEPs to provide credible candidate sequences for verifying SEPs by biological experimenrts. However, there is a lack of methods for identifying SEPs directly. In this study, a machine learning method to identify SEPs of plant lncRNA (ISPL) is proposed. Hybrid features including sequence features and physicochemical features are extracted manually or adaptively to construct different modal features. In order to keep the stability of feature selection, the non-linear correction applied in Max-Relevance-Max-Distance (nocRD) feature selection method is proposed, which integrates multiple feature ranking results and uses the iterative random forest for different modal features dimensionality reduction. Classification models with different modal features are constructed, and their outputs are combined for ensemble classification. The experimental results show that the accuracy of ISPL is 89.86% percent on the independent test set, which will have important implications for further studies of functional genomic.
长链非编码RNA(lncRNA)包含短开放阅读框(sORF),而sORF编码的短肽(SEP)因其在生命活动中的关键作用已成为科学研究的焦点。SEP的鉴定对于进一步了解其调控功能至关重要。生物信息学方法可以快速鉴定SEP,为生物实验验证SEP提供可靠的候选序列。然而,目前缺乏直接鉴定SEP的方法。在本研究中,提出了一种用于鉴定植物lncRNA的SEP的机器学习方法(ISPL)。手动或自适应地提取包括序列特征和物理化学特征在内的混合特征,以构建不同的模态特征。为了保持特征选择的稳定性,提出了在最大相关-最大距离(nocRD)特征选择方法中应用的非线性校正,该方法整合了多个特征排名结果,并使用迭代随机森林对不同模态特征进行降维。构建具有不同模态特征的分类模型,并将其输出进行组合以进行集成分类。实验结果表明,ISPL在独立测试集上的准确率为89.86%,这将对功能基因组学的进一步研究具有重要意义。