Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA.
Department of Neuroscience, Johns Hopkins University, Baltimore, MD, 21205, USA.
Biostatistics. 2022 Oct 14;23(4):1200-1217. doi: 10.1093/biostatistics/kxac005.
Integrative analysis of multiple data sets has the potential of fully leveraging the vast amount of high throughput biological data being generated. In particular such analysis will be powerful in making inference from publicly available collections of genetic, transcriptomic and epigenetic data sets which are designed to study shared biological processes, but which vary in their target measurements, biological variation, unwanted noise, and batch variation. Thus, methods that enable the joint analysis of multiple data sets are needed to gain insights into shared biological processes that would otherwise be hidden by unwanted intra-data set variation. Here, we propose a method called two-stage linked component analysis (2s-LCA) to jointly decompose multiple biologically related experimental data sets with biological and technological relationships that can be structured into the decomposition. The consistency of the proposed method is established and its empirical performance is evaluated via simulation studies. We apply 2s-LCA to jointly analyze four data sets focused on human brain development and identify meaningful patterns of gene expression in human neurogenesis that have shared structure across these data sets.
整合多个数据集的分析有可能充分利用大量生成的高通量生物学数据。特别是,这种分析将非常强大,可以从旨在研究共享生物学过程的公开可用的遗传、转录组和表观遗传数据集集合中进行推断,但这些数据集在目标测量、生物变异、不需要的噪声和批次变异方面存在差异。因此,需要能够联合分析多个数据集的方法来深入了解共享的生物学过程,否则这些过程会被不需要的数据集中的变异所掩盖。在这里,我们提出了一种称为两阶段关联成分分析(2s-LCA)的方法,用于联合分解具有生物学和技术关系的多个生物学相关实验数据集,这些关系可以结构化到分解中。通过模拟研究,建立了所提出方法的一致性,并评估了其经验性能。我们将 2s-LCA 应用于联合分析四个专注于人类大脑发育的数据集,并在这些数据集中识别出人类神经发生中具有共享结构的有意义的基因表达模式。