Cai Binghuang, Li Biao, Kiga Nikki, Thusberg Janita, Bergquist Timothy, Chen Yun-Ching, Niknafs Noushin, Carter Hannah, Tokheim Collin, Beleva-Guthrie Violeta, Douville Christopher, Bhattacharya Rohit, Yeo Hui Ting Grace, Fan Jean, Sengupta Sohini, Kim Dewey, Cline Melissa, Turner Tychele, Diekhans Mark, Zaucha Jan, Pal Lipika R, Cao Chen, Yu Chen-Hsin, Yin Yizhou, Carraro Marco, Giollo Manuel, Ferrari Carlo, Leonardi Emanuela, Tosatto Silvio C E, Bobe Jason, Ball Madeleine, Hoskins Roger A, Repo Susanna, Church George, Brenner Steven E, Moult John, Gough Julian, Stanke Mario, Karchin Rachel, Mooney Sean D
Department of Biomedical Informatics & Medical Education, University of Washington School of Medicine, Seattle, Washington.
The Buck Institute for Research on Aging, Novato, California.
Hum Mutat. 2017 Sep;38(9):1266-1276. doi: 10.1002/humu.23265. Epub 2017 Jun 19.
The advent of next-generation sequencing has dramatically decreased the cost for whole-genome sequencing and increased the viability for its application in research and clinical care. The Personal Genome Project (PGP) provides unrestricted access to genomes of individuals and their associated phenotypes. This resource enabled the Critical Assessment of Genome Interpretation (CAGI) to create a community challenge to assess the bioinformatics community's ability to predict traits from whole genomes. In the CAGI PGP challenge, researchers were asked to predict whether an individual had a particular trait or profile based on their whole genome. Several approaches were used to assess submissions, including ROC AUC (area under receiver operating characteristic curve), probability rankings, the number of correct predictions, and statistical significance simulations. Overall, we found that prediction of individual traits is difficult, relying on a strong knowledge of trait frequency within the general population, whereas matching genomes to trait profiles relies heavily upon a small number of common traits including ancestry, blood type, and eye color. When a rare genetic disorder is present, profiles can be matched when one or more pathogenic variants are identified. Prediction accuracy has improved substantially over the last 6 years due to improved methodology and a better understanding of features.
下一代测序技术的出现极大地降低了全基因组测序的成本,并提高了其在研究和临床护理中应用的可行性。个人基因组计划(PGP)提供了对个体基因组及其相关表型的无限制访问。这一资源使得基因组解释关键评估(CAGI)能够发起一项社区挑战,以评估生物信息学社区从全基因组预测性状的能力。在CAGI PGP挑战中,研究人员被要求根据个体的全基因组预测其是否具有特定的性状或特征。使用了几种方法来评估提交的结果,包括ROC AUC(受试者操作特征曲线下的面积)、概率排名、正确预测的数量以及统计显著性模拟。总体而言,我们发现预测个体性状很困难,这依赖于对一般人群中性状频率的深入了解,而将基因组与性状特征进行匹配则严重依赖于少数常见性状,包括祖先、血型和眼睛颜色。当存在罕见的遗传疾病时,当识别出一个或多个致病变异时,就可以进行特征匹配。由于方法的改进和对特征的更好理解,在过去6年中预测准确性有了显著提高。