通过对多个基因组数据进行整合学习来确定外显子组测序研究中非同义单核苷酸变异的优先级

Prioritization Of Nonsynonymous Single Nucleotide Variants For Exome Sequencing Studies Via Integrative Learning On Multiple Genomic Data.

作者信息

Wu Mengmeng, Wu Jiaxin, Chen Ting, Jiang Rui

机构信息

MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic &Systems Biology, TNLIST; Department of Automation, Tsinghua University, Beijing 100084, China.

Department of Computer Science, Tsinghua University, China.

出版信息

Sci Rep. 2015 Oct 13;5:14955. doi: 10.1038/srep14955.

DOI:10.1038/srep14955

PMID:26459872

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4602202/

Abstract

The rapid advancement of next generation sequencing technology has greatly accelerated the progress for understanding human inherited diseases via such innovations as exome sequencing. Nevertheless, the identification of causative variants from sequencing data remains a great challenge. Traditional statistical genetics approaches such as linkage analysis and association studies have limited power in analyzing exome sequencing data, while relying on simply filtration strategies and predicted functional implications of mutations to pinpoint pathogenic variants are prone to produce false positives. To overcome these limitations, we herein propose a supervised learning approach, termed snvForest, to prioritize candidate nonsynonymous single nucleotide variants for a specific type of disease by integrating 11 functional scores at the variant level and 8 association scores at the gene level. We conduct a series of large-scale in silico validation experiments, demonstrating the effectiveness of snvForest across 2,511 diseases of different inheritance styles and the superiority of our approach over two state-of-the-art methods. We further apply snvForest to three real exome sequencing data sets of epileptic encephalophathies and intellectual disability to show the ability of our approach to identify causative de novo mutations for these complex diseases. The online service and standalone software of snvForest are found at http://bioinfo.au.tsinghua.edu.cn/jianglab/snvforest.

摘要

下一代测序技术的快速发展通过外显子组测序等创新极大地加速了人类遗传疾病的研究进程。然而，从测序数据中识别致病变异仍然是一项巨大挑战。传统的统计遗传学方法，如连锁分析和关联研究，在分析外显子组测序数据时能力有限，而仅依靠过滤策略和突变的预测功能影响来确定致病变异则容易产生假阳性。为克服这些局限性，我们在此提出一种监督学习方法，称为snvForest，通过整合变异水平的11个功能评分和基因水平的8个关联评分，对特定类型疾病的候选非同义单核苷酸变异进行优先级排序。我们进行了一系列大规模的计算机模拟验证实验，证明了snvForest在2511种不同遗传方式疾病中的有效性以及我们的方法相对于两种最先进方法的优越性。我们进一步将snvForest应用于三个癫痫性脑病和智力残疾的真实外显子组测序数据集，以展示我们的方法识别这些复杂疾病致病新生突变的能力。snvForest的在线服务和独立软件可在http://bioinfo.au.tsinghua.edu.cn/jianglab/snvforest获取。