Suppr超能文献

MSPJ:在小基因表达数据集中发现潜在生物标志物的集成学习。

MSPJ: Discovering potential biomarkers in small gene expression datasets ensemble learning.

作者信息

Yin HuaChun, Tao JingXin, Peng Yuyang, Xiong Ying, Li Bo, Li Song, Yang Hui

机构信息

Department of Neurosurgery, Xinqiao Hospital, The Army Medical University, Chongqing 400037, China.

College of Life Sciences, Chongqing Normal University, Chongqing 401331, China.

出版信息

Comput Struct Biotechnol J. 2022 Jul 14;20:3783-3795. doi: 10.1016/j.csbj.2022.07.022. eCollection 2022.

Abstract

In transcriptomics, differentially expressed genes (DEGs) provide fine-grained phenotypic resolution for comparisons between groups and insights into molecular mechanisms underlying the pathogenesis of complex diseases or phenotypes. The robust detection of DEGs from large datasets is well-established. However, owing to various limitations (e.g., the low availability of samples for some diseases or limited research funding), small sample size is frequently used in experiments. Therefore, methods to screen reliable and stable features are urgently needed for analyses with limited sample size. In this study, MSPJ, a new machine learning approach for identifying DEGs was proposed to mitigate the reduced power and improve the stability of DEG identification in small gene expression datasets. This ensemble learning-based method consists of three algorithms: an improved multiple random sampling with -analysis, SVM-RFE (support vector machines-recursive feature elimination), and permutation test. MSPJ was compared with ten classical methods by 94 simulated datasets and large-scale benchmarking with 165 real datasets. The results showed that, among these methods MSPJ had the best performance in most small gene expression datasets, especially those with sample size below 30. In summary, the MSPJ method enables effective feature selection for robust DEG identification in small transcriptome datasets and is expected to expand research on the molecular mechanisms underlying complex diseases or phenotypes.

摘要

在转录组学中,差异表达基因(DEG)为组间比较提供了细粒度的表型分辨率,并有助于深入了解复杂疾病或表型发病机制的分子机制。从大型数据集中可靠地检测DEG已经得到了充分的确立。然而,由于各种限制(例如,某些疾病的样本可用性低或研究资金有限),实验中经常使用小样本量。因此,迫切需要用于有限样本量分析的筛选可靠和稳定特征的方法。在本研究中,提出了一种用于识别DEG的新机器学习方法MSPJ,以减轻功效降低的问题,并提高小基因表达数据集中DEG识别的稳定性。这种基于集成学习的方法由三种算法组成:改进的多重随机抽样与分析、支持向量机递归特征消除(SVM-RFE)和置换检验。通过94个模拟数据集和165个真实数据集的大规模基准测试,将MSPJ与十种经典方法进行了比较。结果表明,在这些方法中,MSPJ在大多数小基因表达数据集中表现最佳,尤其是样本量低于30的数据集。总之,MSPJ方法能够在小转录组数据集中进行有效的特征选择,以进行可靠的DEG识别,并有望扩展对复杂疾病或表型潜在分子机制的研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58df/9304602/48a2d4739219/ga1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验