Suppr超能文献

在大样本中比较用于目标外显子测序的变异检测算法。

Comparing variant calling algorithms for target-exon sequencing in a large sample.

作者信息

Lo Yancy, Kang Hyun M, Nelson Matthew R, Othman Mohammad I, Chissoe Stephanie L, Ehm Margaret G, Abecasis Gonçalo R, Zöllner Sebastian

机构信息

Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI, 48109, USA.

GlaxoSmithKline, Quantitative Sciences, Research Triangle Park, NC, USA.

出版信息

BMC Bioinformatics. 2015 Mar 7;16:75. doi: 10.1186/s12859-015-0489-0.

Abstract

BACKGROUND

Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benefit from sophisticated calling algorithms used in low-coverage sequencing studies. We evaluate the potential benefits of different calling strategies by performing a comparative analysis of variant calling methods on exonic data from 202 genes sequenced at 24x in 7,842 individuals. We call variants using individual-based, population-based and linkage disequilibrium (LD)-aware methods with stringent quality control. We measure genotype accuracy by the concordance with on-target GWAS genotypes and between 80 pairs of sequencing replicates. We validate selected singleton variants using capillary sequencing.

RESULTS

Using these calling methods, we detected over 27,500 variants at the targeted exons; >57% were singletons. The singletons identified by individual-based analyses were of the highest quality. However, individual-based analyses generated more missing genotypes (4.72%) than population-based (0.47%) and LD-aware (0.17%) analyses. Moreover, individual-based genotypes were the least concordant with array-based genotypes and replicates. Population-based genotypes were less concordant than genotypes from LD-aware analyses with extended haplotypes. We reanalyzed the same dataset with a second set of callers and showed again that the individual-based caller identified more high-quality singletons than the population-based caller. We also replicated this result in a second dataset of 57 genes sequenced at 127.5x in 3,124 individuals.

CONCLUSIONS

We recommend population-based analyses for high quality variant calls with few missing genotypes. With extended haplotypes, LD-aware methods generate the most accurate and complete genotypes. In addition, individual-based analyses should complement the above methods to obtain the most singleton variants.

摘要

背景

外显子区域测序研究旨在识别导致复杂性状的罕见变异。由于覆盖度高且样本量大,这些研究倾向于应用简单的变异检测算法。然而,覆盖度往往是不均匀的;覆盖不足的位点可能会受益于低覆盖度测序研究中使用的复杂检测算法。我们通过对7842名个体中24倍测序的202个基因的外显子数据进行变异检测方法的比较分析,评估不同检测策略的潜在益处。我们使用基于个体、基于群体和连锁不平衡(LD)感知的方法进行变异检测,并进行严格的质量控制。我们通过与目标全基因组关联研究(GWAS)基因型以及80对测序重复样本之间的一致性来衡量基因型准确性。我们使用毛细管测序法验证选定的单倍体变异。

结果

使用这些检测方法,我们在目标外显子中检测到超过27500个变异;超过57%是单倍体。基于个体分析识别出的单倍体质量最高。然而,基于个体的分析产生的缺失基因型(4.72%)比基于群体的分析(0.47%)和LD感知分析(0.17%)更多。此外,基于个体的基因型与基于芯片的基因型和重复样本的一致性最低。基于群体的基因型与具有扩展单倍型的LD感知分析的基因型相比,一致性更低。我们用第二组检测工具重新分析了相同的数据集,再次表明基于个体的检测工具比基于群体的检测工具识别出更多高质量的单倍体。我们还在另一个由3124名个体中127.5倍测序的57个基因的数据集上重复了这一结果。

结论

我们建议进行基于群体的分析,以获得高质量的变异检测结果,且缺失基因型较少。对于具有扩展单倍型的情况,LD感知方法能产生最准确和完整的基因型。此外,基于个体的分析应补充上述方法,以获得最多的单倍体变异。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b901/4359451/7bf4017daf62/12859_2015_489_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验