Balsor Justin L, Arbabi Keon, Singh Desmond, Kwan Rachel, Zaslavsky Jonathan, Jeyanesan Ewalina, Murphy Kathryn M
McMaster Neuroscience Graduate Program, McMaster University, Hamilton, ON, Canada.
Department of Psychology, Neuroscience and Behavior, McMaster University, Hamilton, ON, Canada.
Front Neurosci. 2021 Nov 16;15:668293. doi: 10.3389/fnins.2021.668293. eCollection 2021.
Studying the molecular development of the human brain presents unique challenges for selecting a data analysis approach. The rare and valuable nature of human postmortem brain tissue, especially for developmental studies, means the sample sizes are small (), but the use of high throughput genomic and proteomic methods measure the expression levels for hundreds or thousands of variables [e.g., genes or proteins ()] for each sample. This leads to a data structure that is high dimensional ( ≫ ) and introduces the curse of dimensionality, which poses a challenge for traditional statistical approaches. In contrast, high dimensional analyses, especially cluster analyses developed for sparse data, have worked well for analyzing genomic datasets where ≫ . Here we explore applying a lasso-based clustering method developed for high dimensional genomic data with small sample sizes. Using protein and gene data from the developing human visual cortex, we compared clustering methods. We identified an application of sparse -means clustering [robust sparse -means clustering (RSKC)] that partitioned samples into age-related clusters that reflect lifespan stages from birth to aging. RSKC adaptively selects a subset of the genes or proteins contributing to partitioning samples into age-related clusters that progress across the lifespan. This approach addresses a problem in current studies that could not identify multiple postnatal clusters. Moreover, clusters encompassed a range of ages like a series of overlapping waves illustrating that chronological- and brain-age have a complex relationship. In addition, a recently developed workflow to create plasticity phenotypes (Balsor et al., 2020) was applied to the clusters and revealed neurobiologically relevant features that identified how the human visual cortex changes across the lifespan. These methods can help address the growing demand for multimodal integration, from molecular machinery to brain imaging signals, to understand the human brain's development.
研究人类大脑的分子发育在选择数据分析方法方面存在独特挑战。人类死后脑组织的稀缺性和珍贵性,尤其是对于发育研究而言,意味着样本量很小,但高通量基因组和蛋白质组学方法的使用能够测量每个样本中数百或数千个变量(如基因或蛋白质)的表达水平。这导致了一种高维数据结构(样本量远小于变量数),并引入了维度诅咒,给传统统计方法带来了挑战。相比之下,高维分析,尤其是为稀疏数据开发的聚类分析,在分析变量数远大于样本量的基因组数据集时表现良好。在此,我们探索应用一种为小样本量的高维基因组数据开发的基于套索的聚类方法。利用来自发育中的人类视觉皮层的蛋白质和基因数据,我们比较了聚类方法。我们确定了一种稀疏K均值聚类(稳健稀疏K均值聚类,RSKC)的应用,它将样本划分为与年龄相关的簇,反映了从出生到衰老的生命周期阶段。RSKC能自适应地选择有助于将样本划分为跨越生命周期的与年龄相关簇的基因或蛋白质子集。这种方法解决了当前研究中无法识别多个出生后簇的问题。此外,这些簇涵盖了一系列年龄范围,就像一系列重叠的波,说明时间年龄和脑年龄有着复杂的关系。此外,一种最近开发的用于创建可塑性表型的工作流程(Balsor等人,2020年)被应用于这些簇,并揭示了神经生物学相关特征,这些特征确定了人类视觉皮层在整个生命周期中的变化方式。这些方法有助于满足从分子机制到脑成像信号的多模态整合的日益增长的需求,以理解人类大脑的发育。