Seok Junhee, Davis Ronald W, Xiao Wenzhong
School of Electrical Engineering, Korea University, Seoul 136-713, Republic of Korea.
Stanford Genome Technology Center, Palo Alto, California, United States of America.
PLoS One. 2015 May 1;10(5):e0122103. doi: 10.1371/journal.pone.0122103. eCollection 2015.
Accumulated biological knowledge is often encoded as gene sets, collections of genes associated with similar biological functions or pathways. The use of gene sets in the analyses of high-throughput gene expression data has been intensively studied and applied in clinical research. However, the main interest remains in finding modules of biological knowledge, or corresponding gene sets, significantly associated with disease conditions. Risk prediction from censored survival times using gene sets hasn't been well studied. In this work, we propose a hybrid method that uses both single gene and gene set information together to predict patient survival risks from gene expression profiles. In the proposed method, gene sets provide context-level information that is poorly reflected by single genes. Complementarily, single genes help to supplement incomplete information of gene sets due to our imperfect biomedical knowledge. Through the tests over multiple data sets of cancer and trauma injury, the proposed method showed robust and improved performance compared with the conventional approaches with only single genes or gene sets solely. Additionally, we examined the prediction result in the trauma injury data, and showed that the modules of biological knowledge used in the prediction by the proposed method were highly interpretable in biology. A wide range of survival prediction problems in clinical genomics is expected to benefit from the use of biological knowledge.
积累的生物学知识通常被编码为基因集,即与相似生物学功能或通路相关的基因集合。基因集在高通量基因表达数据分析中的应用已得到深入研究,并应用于临床研究。然而,主要兴趣仍在于寻找与疾病状况显著相关的生物学知识模块或相应的基因集。利用基因集从截尾生存时间进行风险预测尚未得到充分研究。在这项工作中,我们提出了一种混合方法,该方法同时使用单个基因和基因集信息来从基因表达谱预测患者生存风险。在所提出的方法中,基因集提供了单个基因难以反映的背景水平信息。作为补充,由于我们的生物医学知识不完善,单个基因有助于补充基因集不完整的信息。通过对多个癌症和创伤损伤数据集的测试,与仅使用单个基因或仅使用基因集的传统方法相比,所提出的方法表现出稳健且改进的性能。此外,我们检查了创伤损伤数据中的预测结果,并表明所提出的方法在预测中使用的生物学知识模块在生物学上具有高度可解释性。临床基因组学中广泛的生存预测问题有望受益于生物学知识的应用。