Frost H Robert, Amos Christopher I
Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA.
Nucleic Acids Res. 2017 Jul 7;45(12):e114. doi: 10.1093/nar/gkx291.
Gene set testing is an important bioinformatics technique that addresses the challenges of power, interpretation and replication. To better support the analysis of large and highly overlapping gene set collections, researchers have recently developed a number of multiset methods that jointly evaluate all gene sets in a collection to identify a parsimonious group of functionally independent sets. Unfortunately, current multiset methods all use binary indicators for gene and gene set activity and assume that a gene is active if any containing gene set is active. This simplistic model limits performance on many types of genomic data. To address this limitation, we developed gene set Selection via LASSO Penalized Regression (SLPR), a novel mapping of multiset gene set testing to penalized multiple linear regression. The SLPR method assumes a linear relationship between continuous measures of gene activity and the activity of all gene sets in the collection. As we demonstrate via simulation studies and the analysis of TCGA data using MSigDB gene sets, the SLPR method outperforms existing multiset methods when the true biological process is well approximated by continuous activity measures and a linear association between genes and gene sets.
基因集测试是一种重要的生物信息学技术,可应对功效、解释和重复方面的挑战。为了更好地支持对大型且高度重叠的基因集集合进行分析,研究人员最近开发了多种多集方法,这些方法联合评估集合中的所有基因集,以识别一组简约的功能独立集。不幸的是,当前的多集方法都使用基因和基因集活性的二元指标,并假设如果任何包含该基因的基因集是活跃的,则该基因就是活跃的。这种简单化的模型限制了在许多类型基因组数据上的性能。为了解决这一限制,我们开发了通过LASSO惩罚回归进行基因集选择(SLPR),这是一种将多集基因集测试映射到惩罚多元线性回归的新方法。SLPR方法假设基因活性的连续测量值与集合中所有基因集的活性之间存在线性关系。正如我们通过模拟研究以及使用MSigDB基因集对TCGA数据进行分析所证明的那样,当真实生物过程通过连续活性测量以及基因与基因集之间的线性关联得到很好的近似时,SLPR方法优于现有的多集方法。