Department of Statistics, Indiana University, Bloomington, IN, USA.
Bioinformatics. 2017 Jul 15;33(14):i350-i358. doi: 10.1093/bioinformatics/btx265.
The discovery of relationships between gene expression measurements and phenotypic responses is hampered by both computational and statistical impediments. Conventional statistical methods are less than ideal because they either fail to select relevant genes, predict poorly, ignore the unknown interaction structure between genes, or are computationally intractable. Thus, the creation of new methods which can handle many expression measurements on relatively small numbers of patients while also uncovering gene-gene relationships and predicting well is desirable.
We develop a new technique for using the marginal relationship between gene expression measurements and patient survival outcomes to identify a small subset of genes which appear highly relevant for predicting survival, produce a low-dimensional embedding based on this small subset, and amplify this embedding with information from the remaining genes. We motivate our methodology by using gene expression measurements to predict survival time for patients with diffuse large B-cell lymphoma, illustrate the behavior of our methodology on carefully constructed synthetic examples, and test it on a number of other gene expression datasets. Our technique is computationally tractable, generally outperforms other methods, is extensible to other phenotypes, and also identifies different genes (relative to existing methods) for possible future study.
All of the code and data are available at http://mypage.iu.edu/∼dajmcdon/research/ .
Supplementary material is available at Bioinformatics online.
基因表达测量值与表型反应之间关系的发现受到计算和统计障碍的阻碍。传统的统计方法并不理想,因为它们要么无法选择相关基因,预测效果不佳,忽略基因之间未知的相互作用结构,要么计算上难以处理。因此,需要创建新的方法,这些方法可以在相对较少的患者中处理大量的表达测量值,同时还可以揭示基因-基因关系并进行良好的预测。
我们开发了一种新的技术,用于利用基因表达测量值与患者生存结果之间的边缘关系来识别一小部分似乎对预测生存非常重要的基因,基于这一小部分生成低维嵌入,并利用其余基因的信息放大该嵌入。我们通过使用基因表达测量值来预测弥漫性大 B 细胞淋巴瘤患者的生存时间来证明我们的方法的合理性,在精心构建的合成示例上说明我们的方法的行为,并在许多其他基因表达数据集上进行测试。我们的技术在计算上是可行的,通常优于其他方法,可扩展到其他表型,并且还可以识别出不同的基因(相对于现有方法),以供未来可能的研究。
所有的代码和数据都可以在 http://mypage.iu.edu/∼dajmcdon/research/ 上获得。
补充材料可在生物信息学在线获得。