Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA.
Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
Bioinformatics. 2022 Jun 24;38(Suppl 1):i378-i385. doi: 10.1093/bioinformatics/btac232.
Modern biological screens yield enormous numbers of measurements, and identifying and interpreting statistically significant associations among features are essential. In experiments featuring multiple high-dimensional datasets collected from the same set of samples, it is useful to identify groups of associated features between the datasets in a way that provides high statistical power and false discovery rate (FDR) control.
Here, we present a novel hierarchical framework, HAllA (Hierarchical All-against-All association testing), for structured association discovery between paired high-dimensional datasets. HAllA efficiently integrates hierarchical hypothesis testing with FDR correction to reveal significant linear and non-linear block-wise relationships among continuous and/or categorical data. We optimized and evaluated HAllA using heterogeneous synthetic datasets of known association structure, where HAllA outperformed all-against-all and other block-testing approaches across a range of common similarity measures. We then applied HAllA to a series of real-world multiomics datasets, revealing new associations between gene expression and host immune activity, the microbiome and host transcriptome, metabolomic profiling and human health phenotypes.
An open-source implementation of HAllA is freely available at http://huttenhower.sph.harvard.edu/halla along with documentation, demo datasets and a user group.
Supplementary data are available at Bioinformatics online.
现代生物学筛选产生了大量的测量结果,识别和解释特征之间具有统计学意义的关联是至关重要的。在具有从同一组样本中收集的多个高维数据集的实验中,以提供高统计功效和错误发现率 (FDR) 控制的方式在数据集之间识别相关特征组是很有用的。
在这里,我们提出了一种新颖的层次框架 HAllA(分层全对全关联测试),用于配对高维数据集之间的结构化关联发现。HAllA 有效地将层次假设检验与 FDR 校正相结合,以揭示连续和/或分类数据之间的显著线性和非线性块状关系。我们使用具有已知关联结构的异构合成数据集对 HAllA 进行了优化和评估,HAllA 在一系列常见的相似性度量中优于全对全和其他块状测试方法。然后,我们将 HAllA 应用于一系列真实的多组学数据集,揭示了基因表达与宿主免疫活性、微生物组与宿主转录组、代谢组学分析与人类健康表型之间的新关联。
HAllA 的开源实现可在 http://huttenhower.sph.harvard.edu/halla 上免费获得,包括文档、演示数据集和用户组。
补充数据可在生物信息学在线获得。