Department of Mathematics, Stanford University, Stanford, CA 94305, USA.
Proc Natl Acad Sci U S A. 2011 Apr 26;108(17):7265-70. doi: 10.1073/pnas.1102826108. Epub 2011 Apr 11.
High-throughput biological data, whether generated as sequencing, transcriptional microarrays, proteomic, or other means, continues to require analytic methods that address its high dimensional aspects. Because the computational part of data analysis ultimately identifies shape characteristics in the organization of data sets, the mathematics of shape recognition in high dimensions continues to be a crucial part of data analysis. This article introduces a method that extracts information from high-throughput microarray data and, by using topology, provides greater depth of information than current analytic techniques. The method, termed Progression Analysis of Disease (PAD), first identifies robust aspects of cluster analysis, then goes deeper to find a multitude of biologically meaningful shape characteristics in these data. Additionally, because PAD incorporates a visualization tool, it provides a simple picture or graph that can be used to further explore these data. Although PAD can be applied to a wide range of high-throughput data types, it is used here as an example to analyze breast cancer transcriptional data. This identified a unique subgroup of Estrogen Receptor-positive (ER(+)) breast cancers that express high levels of c-MYB and low levels of innate inflammatory genes. These patients exhibit 100% survival and no metastasis. No supervised step beyond distinction between tumor and healthy patients was used to identify this subtype. The group has a clear and distinct, statistically significant molecular signature, it highlights coherent biology but is invisible to cluster methods, and does not fit into the accepted classification of Luminal A/B, Normal-like subtypes of ER(+) breast cancers. We denote the group as c-MYB(+) breast cancer.
高通量生物数据,无论是测序、转录微阵列、蛋白质组学还是其他方法产生的,都需要分析方法来解决其高维方面的问题。由于数据分析的计算部分最终确定了数据集组织中的形状特征,因此高维形状识别的数学仍然是数据分析的关键部分。本文介绍了一种从高通量微阵列数据中提取信息的方法,通过使用拓扑学,提供了比当前分析技术更深入的信息。该方法称为疾病进展分析(PAD),首先识别聚类分析中的稳健方面,然后深入挖掘这些数据中的多种生物学有意义的形状特征。此外,由于 PAD 包含可视化工具,因此它提供了一个简单的图像或图表,可以用于进一步探索这些数据。虽然 PAD 可以应用于广泛的高通量数据类型,但本文将其作为示例来分析乳腺癌转录数据。这确定了一组独特的雌激素受体阳性(ER(+))乳腺癌,这些肿瘤表达高水平的 c-MYB 和低水平的固有炎症基因。这些患者的存活率为 100%,没有转移。没有使用超出肿瘤和健康患者区分的监督步骤来识别这种亚型。该组具有明确且显著的分子特征,突出了连贯的生物学,但对聚类方法不可见,也不符合 ER(+)乳腺癌的 Luminal A/B、正常样亚型的公认分类。我们将该组命名为 c-MYB(+)乳腺癌。