Department of Statistics, University of California, Berkeley, CA 94720, USA.
Proc Natl Acad Sci U S A. 2010 Apr 13;107(15):6823-8. doi: 10.1073/pnas.0912043107. Epub 2010 Apr 1.
The rapid accumulation of gene expression data has offered unprecedented opportunities to study human diseases. The National Center for Biotechnology Information Gene Expression Omnibus is currently the largest database that systematically documents the genome-wide molecular basis of diseases. However, thus far, this resource has been far from fully utilized. This paper describes the first study to transform public gene expression repositories into an automated disease diagnosis database. Particularly, we have developed a systematic framework, including a two-stage Bayesian learning approach, to achieve the diagnosis of one or multiple diseases for a query expression profile along a hierarchical disease taxonomy. Our approach, including standardizing cross-platform gene expression data and heterogeneous disease annotations, allows analyzing both sources of information in a unified probabilistic system. A high level of overall diagnostic accuracy was shown by cross validation. It was also demonstrated that the power of our method can increase significantly with the continued growth of public gene expression repositories. Finally, we showed how our disease diagnosis system can be used to characterize complex phenotypes and to construct a disease-drug connectivity map.
基因表达数据的快速积累为研究人类疾病提供了前所未有的机会。美国国立生物技术信息中心基因表达综合数据库是目前系统记录疾病全基因组分子基础的最大数据库。然而,迄今为止,这一资源还远未得到充分利用。本文描述了将公共基因表达库转化为自动疾病诊断数据库的第一项研究。具体来说,我们开发了一种系统框架,包括两阶段贝叶斯学习方法,以实现沿着层次化疾病分类法对查询表达谱进行一种或多种疾病的诊断。我们的方法包括标准化跨平台基因表达数据和异构疾病注释,允许在统一的概率系统中分析这两种信息源。交叉验证显示了总体诊断准确性达到了较高水平。还证明了随着公共基因表达库的不断增长,我们方法的功效可以显著提高。最后,我们展示了如何使用我们的疾病诊断系统来描述复杂的表型并构建疾病-药物连接图。