School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China.
School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China.
Comput Biol Med. 2023 May;157:106773. doi: 10.1016/j.compbiomed.2023.106773. Epub 2023 Mar 11.
Recently, small open reading frames (sORFs) in long noncoding RNA (lncRNA) have been demonstrated to encode small peptides that can help study the mechanisms of growth and development in organisms. Since machine learning-based computational methods are less costly compared with biological experiments, they can be used to identify sORFs and provide a basis for biological experiments. However, few computational methods and data resources have been exploited for identifying sORFs in plant lncRNA. Besides, machine learning models produce underperforming classifiers when faced with a class-imbalance problem. In this study, an alternative method called SMOTE based on weighted cosine distance (WCDSMOTE) which enables interaction with feature selection is put forward to synthesize minority class samples and weighted edited nearest neighbor (WENN) is applied to clean up majority class samples, thus, hybrid sampling WCDSMOTE-ENN is proposed to deal with imbalanced datasets with the multi-angle feature. A heterogeneous classifier ensemble is introduced to complete the classification task. Therefore, a novel computational method that is based on class-imbalance learning to identify the sORFs with coding potential in plant lncRNA (sORFplnc) is presented. Experimental results manifest that sORFplnc outperforms existing computational methods in identifying sORFs with coding potential. We anticipate that the proposed work can be a reference for relevant research and contribute to agriculture and biomedicine.
最近,长非编码 RNA(lncRNA)中的小开放阅读框(sORF)已被证明可以编码小肽,这有助于研究生物生长和发育的机制。由于基于机器学习的计算方法比生物实验成本低,因此可以用于识别 sORF,并为生物实验提供基础。然而,用于鉴定植物 lncRNA 中 sORF 的计算方法和数据资源很少。此外,当面临类不平衡问题时,机器学习模型会生成性能不佳的分类器。在本研究中,提出了一种称为基于加权余弦距离的 SMOTE(WCDSMOTE)的替代方法,该方法可以与特征选择进行交互,以合成少数类样本,并应用加权编辑最近邻(WENN)清理多数类样本,从而提出了混合采样 WCDSMOTE-ENN 来处理具有多视角特征的不平衡数据集。引入异构分类器集成来完成分类任务。因此,提出了一种基于类不平衡学习的识别植物 lncRNA 中具有编码潜力的 sORF(sORFplnc)的新型计算方法。实验结果表明,sORFplnc 在识别具有编码潜力的 sORF 方面优于现有计算方法。我们期望这项工作可以为相关研究提供参考,并为农业和生物医学做出贡献。