Federal University of Santa Catarina, Joinville, Santa Catarina, Brazil.
IZKF Computational Biology Research Group, Institute for Biomedical Engineering, RWTH Aachen University Medical School, Aachen, Germany.
Methods. 2018 Jan 1;132:42-49. doi: 10.1016/j.ymeth.2017.07.023. Epub 2017 Aug 2.
RNA-Seq is becoming the standard technology for large-scale gene expression level measurements, as it offers a number of advantages over microarrays. Standards for RNA-Seq data analysis are, however, in its infancy when compared to those of microarrays. Clustering, which is essential for understanding gene expression data, has been widely investigated w.r.t. microarrays. In what concerns the clustering of RNA-Seq data, however, a number of questions remain open, resulting in a lack of guidelines to practitioners. Here we evaluate computational steps relevant for clustering cancer samples via an empirical analysis of 15mRNA-seq datasets. Our evaluation considers strategies regarding expression estimates, number of genes after non-specific filtering and data transformations. We evaluate the performance of four clustering algorithms and twelve distance measures, which are commonly used for gene expression analysis. Results support that clustering cancer samples based on a gene quantification should be preferred. The use of non-specific filtering leading to a small number of features (1,000) presents, in general, superior results. Data should be log-transformed previously to cluster analysis. Regarding the choice of clustering algorithms, Average-Linkage and k-medoids provide, in general, superior recoveries. Although specific cases can benefit from a careful selection of a distance measure, Symmetric Rank-Magnitude correlation provides consistent and sound results in different scenarios.
RNA-Seq 正在成为大规模基因表达水平测量的标准技术,因为它相对于微阵列具有许多优势。然而,与微阵列相比,RNA-Seq 数据分析的标准还处于起步阶段。聚类对于理解基因表达数据至关重要,已经在微阵列方面得到了广泛的研究。然而,在 RNA-Seq 数据的聚类方面,仍然存在一些悬而未决的问题,导致缺乏针对实践者的指导方针。在这里,我们通过对 15 个 mRNA-seq 数据集的实证分析来评估通过 RNA-Seq 聚类癌症样本的计算步骤。我们的评估考虑了与表达估计、非特异性过滤后基因数量和数据转换有关的策略。我们评估了四种聚类算法和十二种距离度量的性能,这些算法和距离度量常用于基因表达分析。结果支持基于基因定量对癌症样本进行聚类。通常,使用非特异性过滤导致特征数量较少(1,000)会产生更好的结果。在进行聚类分析之前,数据应该进行对数转换。关于聚类算法的选择,平均链接和 k-均值聚类通常提供更好的恢复效果。虽然在特定情况下可以通过仔细选择距离度量来受益,但对称秩幅度相关性在不同情况下提供了一致和可靠的结果。