Ye Shuyun, Dawson John A, Kendziorski Christina
Department of Statistics, University of Wisconsin, Madison, WI, USA.
Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA.
Cancer Inform. 2015 Feb 10;13(Suppl 7):85-95. doi: 10.4137/CIN.S16354. eCollection 2014.
Genomic-based studies of disease now involve diverse types of data collected on large groups of patients. A major challenge facing statistical scientists is how best to combine the data, extract important features, and comprehensively characterize the ways in which they affect an individual's disease course and likelihood of response to treatment. We have developed a survival-supervised latent Dirichlet allocation (survLDA) modeling framework to address these challenges. Latent Dirichlet allocation (LDA) models have proven extremely effective at identifying themes common across large collections of text, but applications to genomics have been limited. Our framework extends LDA to the genome by considering each patient as a "document" with "text" detailing his/her clinical events and genomic state. We then further extend the framework to allow for supervision by a time-to-event response. The model enables the efficient identification of collections of clinical and genomic features that co-occur within patient subgroups, and then characterizes each patient by those features. An application of survLDA to The Cancer Genome Atlas ovarian project identifies informative patient subgroups showing differential response to treatment, and validation in an independent cohort demonstrates the potential for patient-specific inference.
基于基因组的疾病研究现在涉及在大量患者群体上收集的各种类型的数据。统计科学家面临的一个主要挑战是如何最好地整合这些数据、提取重要特征,并全面描述它们影响个体疾病进程和治疗反应可能性的方式。我们开发了一种生存监督潜在狄利克雷分配(survLDA)建模框架来应对这些挑战。潜在狄利克雷分配(LDA)模型在识别大量文本集合中常见的主题方面已被证明极其有效,但在基因组学中的应用一直有限。我们的框架通过将每个患者视为一个“文档”,其“文本”详细描述了他/她的临床事件和基因组状态,将LDA扩展到了基因组。然后,我们进一步扩展该框架,以允许通过事件发生时间响应进行监督。该模型能够有效地识别在患者亚组中共同出现的临床和基因组特征集合,然后用这些特征对每个患者进行表征。survLDA在癌症基因组图谱卵巢癌项目中的应用识别出了对治疗有不同反应的信息丰富的患者亚组,在一个独立队列中的验证证明了进行患者特异性推断的潜力。