Suppr超能文献

利用不平衡学习识别植物 lncRNA 中的小开放阅读框。

Identification of small open reading frames in plant lncRNA using class-imbalance learning.

机构信息

School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China.

School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China.

出版信息

Comput Biol Med. 2023 May;157:106773. doi: 10.1016/j.compbiomed.2023.106773. Epub 2023 Mar 11.

Abstract

Recently, small open reading frames (sORFs) in long noncoding RNA (lncRNA) have been demonstrated to encode small peptides that can help study the mechanisms of growth and development in organisms. Since machine learning-based computational methods are less costly compared with biological experiments, they can be used to identify sORFs and provide a basis for biological experiments. However, few computational methods and data resources have been exploited for identifying sORFs in plant lncRNA. Besides, machine learning models produce underperforming classifiers when faced with a class-imbalance problem. In this study, an alternative method called SMOTE based on weighted cosine distance (WCDSMOTE) which enables interaction with feature selection is put forward to synthesize minority class samples and weighted edited nearest neighbor (WENN) is applied to clean up majority class samples, thus, hybrid sampling WCDSMOTE-ENN is proposed to deal with imbalanced datasets with the multi-angle feature. A heterogeneous classifier ensemble is introduced to complete the classification task. Therefore, a novel computational method that is based on class-imbalance learning to identify the sORFs with coding potential in plant lncRNA (sORFplnc) is presented. Experimental results manifest that sORFplnc outperforms existing computational methods in identifying sORFs with coding potential. We anticipate that the proposed work can be a reference for relevant research and contribute to agriculture and biomedicine.

摘要

最近,长非编码 RNA(lncRNA)中的小开放阅读框(sORF)已被证明可以编码小肽,这有助于研究生物生长和发育的机制。由于基于机器学习的计算方法比生物实验成本低,因此可以用于识别 sORF,并为生物实验提供基础。然而,用于鉴定植物 lncRNA 中 sORF 的计算方法和数据资源很少。此外,当面临类不平衡问题时,机器学习模型会生成性能不佳的分类器。在本研究中,提出了一种称为基于加权余弦距离的 SMOTE(WCDSMOTE)的替代方法,该方法可以与特征选择进行交互,以合成少数类样本,并应用加权编辑最近邻(WENN)清理多数类样本,从而提出了混合采样 WCDSMOTE-ENN 来处理具有多视角特征的不平衡数据集。引入异构分类器集成来完成分类任务。因此,提出了一种基于类不平衡学习的识别植物 lncRNA 中具有编码潜力的 sORF(sORFplnc)的新型计算方法。实验结果表明,sORFplnc 在识别具有编码潜力的 sORF 方面优于现有计算方法。我们期望这项工作可以为相关研究提供参考,并为农业和生物医学做出贡献。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验