Suppr超能文献

用于高维RNA测序数据聚类以识别癌症亚型的特征选择方法比较

Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes.

作者信息

Källberg David, Vidman Linda, Rydén Patrik

机构信息

Department of Statistics, USBE, Umeå University, Umeå, Sweden.

Department of Mathematics and Mathematical Statistics, Umeå University, Umeå, Sweden.

出版信息

Front Genet. 2021 Feb 24;12:632620. doi: 10.3389/fgene.2021.632620. eCollection 2021.

Abstract

Cancer subtype identification is important to facilitate cancer diagnosis and select effective treatments. Clustering of cancer patients based on high-dimensional RNA-sequencing data can be used to detect novel subtypes, but only a subset of the features (e.g., genes) contains information related to the cancer subtype. Therefore, it is reasonable to assume that the clustering should be based on a set of carefully selected features rather than all features. Several feature selection methods have been proposed, but how and when to use these methods are still poorly understood. Thirteen feature selection methods were evaluated on four human cancer data sets, all with known subtypes (gold standards), which were only used for evaluation. The methods were characterized by considering mean expression and standard deviation (SD) of the selected genes, the overlap with other methods and their clustering performance, obtained comparing the clustering result with the gold standard using the adjusted Rand index (ARI). The results were compared to a supervised approach as a positive control and two negative controls in which either a random selection of genes or all genes were included. For all data sets, the best feature selection approach outperformed the negative control and for two data sets the gain was substantial with ARI increasing from (-0.01, 0.39) to (0.66, 0.72), respectively. No feature selection method completely outperformed the others but using the dip-rest statistic to select 1000 genes was overall a good choice. The commonly used approach, where genes with the highest SDs are selected, did not perform well in our study.

摘要

癌症亚型识别对于促进癌症诊断和选择有效治疗方法至关重要。基于高维RNA测序数据对癌症患者进行聚类可用于检测新的亚型,但只有一部分特征(例如基因)包含与癌症亚型相关的信息。因此,有理由假设聚类应基于一组精心挑选的特征而非所有特征。已经提出了几种特征选择方法,但如何以及何时使用这些方法仍知之甚少。在四个人类癌症数据集上评估了13种特征选择方法,所有数据集都有已知的亚型(金标准),这些金标准仅用于评估。这些方法的特点是考虑所选基因的平均表达和标准差(SD)、与其他方法的重叠以及它们的聚类性能,通过使用调整后的兰德指数(ARI)将聚类结果与金标准进行比较来获得。将结果与作为阳性对照的监督方法以及两个阴性对照进行比较,阴性对照中要么随机选择基因,要么包含所有基因。对于所有数据集,最佳的特征选择方法优于阴性对照,对于两个数据集,增益相当可观,ARI分别从(-0.01, 0.39)增加到(0.66, 0.72)。没有一种特征选择方法完全优于其他方法,但使用dip-rest统计量选择1000个基因总体上是个不错的选择。在我们的研究中,常用的选择标准差最高的基因的方法表现不佳。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验