Pal Lipika R, Kundu Kunal, Yin Yizhou, Moult John
Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland.
Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, Maryland.
Hum Mutat. 2017 Sep;38(9):1225-1234. doi: 10.1002/humu.23256. Epub 2017 Jun 28.
Understanding the basis of complex trait disease is a fundamental problem in human genetics. The CAGI Crohn's Exome challenges are providing insight into the adequacy of current disease models by requiring participants to identify which of a set of individuals has been diagnosed with the disease, given exome data. For the CAGI4 round, we developed a method that used the genotypes from exome sequencing data only to impute the status of genome wide association studies marker SNPs. We then used the imputed genotypes as input to several machine learning methods that had been trained to predict disease status from marker SNP information. We achieved the best performance using Naïve Bayes and with a consensus machine learning method, obtaining an area under the curve of 0.72, larger than other methods used in CAGI4. We also developed a model that incorporated the contribution from rare missense variants in the exome data, but this performed less well. Future progress is expected to come from the use of whole genome data rather than exomes.
了解复杂性状疾病的基础是人类遗传学中的一个基本问题。CAGI克罗恩病外显子组挑战通过要求参与者根据外显子组数据识别一组个体中哪些已被诊断患有该疾病,为洞察当前疾病模型的充分性提供了思路。对于CAGI4轮,我们开发了一种仅使用外显子组测序数据中的基因型来推断全基因组关联研究标记单核苷酸多态性(SNP)状态的方法。然后,我们将推断出的基因型作为输入,用于几种经过训练可根据标记SNP信息预测疾病状态的机器学习方法。我们使用朴素贝叶斯方法和一种共识机器学习方法取得了最佳性能,曲线下面积为0.72,大于CAGI4中使用的其他方法。我们还开发了一个纳入外显子组数据中罕见错义变异贡献的模型,但该模型表现较差。预计未来的进展将来自全基因组数据而非外显子组的使用。