School of Public Health, University of Alberta, AB, Canada.
Public Health Dynamics Laboratory, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, PA, USA.
Comput Biol Med. 2019 Oct;113:103389. doi: 10.1016/j.compbiomed.2019.103389. Epub 2019 Aug 17.
Gene set analysis is a popular approach to examine the association between a predefined gene set and a phenotype. Few methods have been developed for a continuous phenotype. However, often not all the genes within a significant gene set contribute to its significance. There is no gene set reduction method developed for continuous phenotype. We developed a computationally efficient analytical tool, called linear combination test for gene set reduction (LCT-GSR) to identify core subsets of gene sets associated with a continuous phenotype. Identifying the core subset enhances our understanding of the biological mechanism and reduces costs of disease risk assessment, diagnosis and treatment.
We evaluated the performance of our analytical tool by applying it to two real microarray studies. In the first application, we analyzed pathway expression measurements in newborns' blood to discover core genes contributing to the variation in birth weight. On average, we were able to reduce the number of genes in the 33 significant gene sets of embryonic stem cell signatures by 84.3% resulting in 229 unique genes. Using immunologic signatures, on average we reduced the number of genes in the 210 significant gene sets by 89% leading to 1603 unique genes. There were 180 unique core genes overlapping across the two databases. In the second application, we analyzed pathway expression measurements in a cohort of lethal prostate cancer patients from Swedish Watchful Waiting cohort to identify main genes associated with tumor volume. On average, we were able to reduce the number of genes in the 17 gene sets by 90% resulting in 47 unique genes.
We conclude that LCT-GSR is a statistically sound analytical tool that can be used to extract core genes associated with a continuous phenotype. It can be applied to a wide range of studies in which dichotomizing the continuous phenotype is neither easy nor meaningful. Reduction to the most predictive genes is crucial in advancing our understanding of issues such as disease prevention, faster and more efficient diagnosis, intervention strategies and personalized medicine.
基因集分析是一种研究预定义基因集与表型之间关联的常用方法。虽然已经开发了一些用于连续表型的方法,但并非所有显著基因集中的基因都对其显著程度有贡献。目前还没有针对连续表型的基因集降维方法。我们开发了一种计算效率高的分析工具,称为线性组合检验基因集降维(LCT-GSR),用于识别与连续表型相关的基因集的核心子集。确定核心子集可以增强我们对生物学机制的理解,并降低疾病风险评估、诊断和治疗的成本。
我们通过将其应用于两个真实的微阵列研究来评估我们的分析工具的性能。在第一个应用中,我们分析了新生儿血液中的途径表达测量值,以发现导致出生体重变化的核心基因。平均而言,我们能够将胚胎干细胞特征的 33 个显著基因集的基因数量减少 84.3%,得到 229 个独特基因。使用免疫特征,我们将 210 个显著基因集的基因数量减少了 89%,得到 1603 个独特基因。有 180 个独特的核心基因在两个数据库中重叠。在第二个应用中,我们分析了来自瑞典观察等待队列的致命前列腺癌患者的途径表达测量值,以识别与肿瘤体积相关的主要基因。平均而言,我们能够将 17 个基因集的基因数量减少 90%,得到 47 个独特基因。
我们得出结论,LCT-GSR 是一种统计上合理的分析工具,可用于提取与连续表型相关的核心基因。它可以应用于广泛的研究领域,在这些领域中,将连续表型二值化既不容易也没有意义。将基因数量减少到最具预测性的基因对于推进我们对疾病预防、更快更有效的诊断、干预策略和个性化医疗等问题的理解至关重要。