Liu Yongzhuang, Li Bingshan, Tan Renjie, Zhu Xiaolin, Wang Yadong
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China, Center for Human Genome Variation, Duke University, Durham, NC 27708 and Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN 37235, USASchool of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China, Center for Human Genome Variation, Duke University, Durham, NC 27708 and Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN 37235, USA.
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China, Center for Human Genome Variation, Duke University, Durham, NC 27708 and Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN 37235, USA.
Bioinformatics. 2014 Jul 1;30(13):1830-6. doi: 10.1093/bioinformatics/btu141. Epub 2014 Mar 10.
Whole-genome and -exome sequencing on parent-offspring trios is a powerful approach to identifying disease-associated genes by detecting de novo mutations in patients. Accurate detection of de novo mutations from sequencing data is a critical step in trio-based genetic studies. Existing bioinformatic approaches usually yield high error rates due to sequencing artifacts and alignment issues, which may either miss true de novo mutations or call too many false ones, making downstream validation and analysis difficult. In particular, current approaches have much worse specificity than sensitivity, and developing effective filters to discriminate genuine from spurious de novo mutations remains an unsolved challenge.
In this article, we curated 59 sequence features in whole genome and exome alignment context which are considered to be relevant to discriminating true de novo mutations from artifacts, and then employed a machine-learning approach to classify candidates as true or false de novo mutations. Specifically, we built a classifier, named De Novo Mutation Filter (DNMFilter), using gradient boosting as the classification algorithm. We built the training set using experimentally validated true and false de novo mutations as well as collected false de novo mutations from an in-house large-scale exome-sequencing project. We evaluated DNMFilter's theoretical performance and investigated relative importance of different sequence features on the classification accuracy. Finally, we applied DNMFilter on our in-house whole exome trios and one CEU trio from the 1000 Genomes Project and found that DNMFilter could be coupled with commonly used de novo mutation detection approaches as an effective filtering approach to significantly reduce false discovery rate without sacrificing sensitivity.
The software DNMFilter implemented using a combination of Java and R is freely available from the website at http://humangenome.duke.edu/software.
对亲子三联体进行全基因组和外显子组测序是通过检测患者的新生突变来识别疾病相关基因的有力方法。从测序数据中准确检测新生突变是基于三联体的遗传研究中的关键步骤。由于测序假象和比对问题,现有的生物信息学方法通常会产生较高的错误率,这可能会遗漏真正的新生突变或产生过多的假阳性,从而使下游的验证和分析变得困难。特别是,当前方法的特异性比敏感性差得多,开发有效的过滤器以区分真正的和虚假的新生突变仍然是一个未解决的挑战。
在本文中,我们整理了全基因组和外显子组比对背景下的59个序列特征,这些特征被认为与区分真正的新生突变和假象有关,然后采用机器学习方法将候选突变分类为真正的或虚假的新生突变。具体来说,我们构建了一个名为新生突变过滤器(DNMFilter)的分类器,使用梯度提升作为分类算法。我们使用经过实验验证的真、假新生突变以及从内部大规模外显子组测序项目中收集的假新生突变构建了训练集。我们评估了DNMFilter的理论性能,并研究了不同序列特征对分类准确性的相对重要性。最后,我们将DNMFilter应用于我们内部的全外显子三联体和来自千人基因组计划的一个CEU三联体,发现DNMFilter可以与常用的新生突变检测方法结合作为一种有效的过滤方法,在不牺牲敏感性的情况下显著降低错误发现率。
使用Java和R组合实现的软件DNMFilter可从网站http://humangenome.duke.edu/software免费获得。