Division of Rheumatology, Center for Autoimmune Genomics and Etiology, Cincinnati Children's Hospital Medical Center, Cincinnati OH, USA ; Medical Scientist Training Program, University of Cincinnati College of Medicine, Cincinnati OH, USA.
Division of Rheumatology, Center for Autoimmune Genomics and Etiology, Cincinnati Children's Hospital Medical Center, Cincinnati OH, USA ; Department of Veterans Affairs, Veterans Affairs Medical Center - Cincinnati, Cincinnati OH, USA.
Front Genet. 2014 Feb 12;5:16. doi: 10.3389/fgene.2014.00016. eCollection 2014.
Next Generation Sequencing studies generate a large quantity of genetic data in a relatively cost and time efficient manner and provide an unprecedented opportunity to identify candidate causative variants that lead to disease phenotypes. A challenge to these studies is the generation of sequencing artifacts by current technologies. To identify and characterize the properties that distinguish false positive variants from true variants, we sequenced a child and both parents (one trio) using DNA isolated from three sources (blood, buccal cells, and saliva). The trio strategy allowed us to identify variants in the proband that could not have been inherited from the parents (Mendelian errors) and would most likely indicate sequencing artifacts. Quality control measurements were examined and three measurements were found to identify the greatest number of Mendelian errors. These included read depth, genotype quality score, and alternate allele ratio. Filtering the variants on these measurements removed ~95% of the Mendelian errors while retaining 80% of the called variants. These filters were applied independently. After filtering, the concordance between identical samples isolated from different sources was 99.99% as compared to 87% before filtering. This high concordance suggests that different sources of DNA can be used in trio studies without affecting the ability to identify causative polymorphisms. To facilitate analysis of next generation sequencing data, we developed the Cincinnati Analytical Suite for Sequencing Informatics (CASSI) to store sequencing files, metadata (eg. relatedness information), file versioning, data filtering, variant annotation, and identify candidate causative polymorphisms that follow either de novo, rare recessive homozygous or compound heterozygous inheritance models. We conclude the data cleaning process improves the signal to noise ratio in terms of variants and facilitates the identification of candidate disease causative polymorphisms.
下一代测序研究以相对经济和高效的方式生成大量遗传数据,并提供了前所未有的机会来识别导致疾病表型的候选致病变体。这些研究面临的一个挑战是当前技术产生的测序伪影。为了识别和描述区分假阳性变体和真实变体的特性,我们使用从三个来源(血液、口腔细胞和唾液)分离的 DNA 对一个孩子和他的父母(一个三重)进行了测序。三重策略使我们能够识别出不可能从父母那里遗传的(孟德尔错误)并很可能表明是测序伪影的变体。对质量控制测量进行了检查,发现有三个测量值可以识别出最大数量的孟德尔错误。这些包括读取深度、基因型质量得分和替代等位基因比。对这些测量值进行过滤可去除约 95%的孟德尔错误,同时保留 80%的已调用变体。这些过滤器是独立应用的。过滤后,来自不同来源的相同样本之间的一致性为 99.99%,而过滤前为 87%。这种高度一致性表明,在三重研究中可以使用不同来源的 DNA,而不会影响识别致病多态性的能力。为了方便下一代测序数据分析,我们开发了辛辛那提分析测序信息套件 (CASSI),用于存储测序文件、元数据(例如,亲缘关系信息)、文件版本控制、数据过滤、变体注释,并识别遵循从头出现、罕见隐性纯合子或复合杂合子遗传模型的候选致病多态性。我们得出结论,数据清理过程提高了变体的信噪比,并有助于识别候选疾病致病多态性。