Department of Acute and Tertiary Care, University of Tennessee Health Science Center, Memphis, 38163, USA.
Department of Biostatistics, St Jude Children's Research Hospital, Memphis, 38105, USA.
BMC Bioinformatics. 2021 Apr 21;22(1):207. doi: 10.1186/s12859-021-04110-x.
Identifying sets of related genes (gene sets) that are empirically associated with a treatment or phenotype often yields valuable biological insights. Several methods effectively identify gene sets in which individual genes have simple monotonic relationships with categorical, quantitative, or censored event-time variables. Some distance-based methods, such as distance correlations, may detect complex non-monotone associations of a gene-set with a quantitative variable that elude other methods. However, the distance correlations have yet to be generalized to associate gene-sets with categorical and censored event-time endpoints. Also, there is a need to determine which genes empirically drive the significance of an association of a gene set with an endpoint.
We develop gene-set distance analysis (GSDA) by generalizing distance correlations to evaluate the association of a gene set with categorical and censored event-time variables. We also develop a backward elimination procedure to identify a subset of genes that empirically drive significant associations. In simulation studies, GSDA more effectively identified complex non-monotone gene-set associations than did six other published methods. In the analysis of a pediatric acute myeloid leukemia (AML) data set, GSDA was the only method to discover that event-free survival (EFS) was associated with the 56-gene AML pathway gene-set, narrow that result down to 5 genes, and confirm the association of those 5 genes with EFS in a separate validation cohort. These results indicate that GSDA effectively identifies and characterizes complex non-monotonic gene-set associations that are missed by other methods.
GSDA is a powerful and flexible method to detect gene-set association with categorical, quantitative, or censored event-time variables, especially to detect complex non-monotonic gene-set associations. Available at https://CRAN.R-project.org/package=GSDA .
识别与治疗或表型有经验关联的相关基因集(基因集)通常会产生有价值的生物学见解。有几种方法可以有效地识别出个体基因与分类、定量或删失事件时间变量具有简单单调关系的基因集。一些基于距离的方法,如距离相关系数,可能会检测到基因集与定量变量之间复杂的非单调关联,而其他方法则无法检测到。然而,距离相关系数尚未推广到将基因集与分类和删失事件时间终点相关联。此外,还需要确定哪些基因实际上驱动了基因集与终点之间关联的显著性。
我们通过将距离相关系数推广到基因集距离分析(GSDA)中来评估基因集与分类和删失事件时间变量的关联。我们还开发了一种后向消除程序来识别一组实际上驱动显著关联的基因。在模拟研究中,GSDA 比其他六种已发表的方法更有效地识别出复杂的非单调基因集关联。在对儿科急性髓细胞白血病(AML)数据集的分析中,GSDA 是唯一发现无事件生存(EFS)与 AML 途径基因集的 56 个基因相关的方法,将该结果缩小到 5 个基因,并确认这 5 个基因与 EFS 在单独的验证队列中的关联。这些结果表明,GSDA 有效地识别和描述了其他方法错过的复杂非单调基因集关联。
GSDA 是一种强大而灵活的方法,可用于检测基因集与分类、定量或删失事件时间变量的关联,特别是用于检测复杂的非单调基因集关联。可在 https://CRAN.R-project.org/package=GSDA 获得。