一种用于从测序数据中进行 SNP 调用、突变发现、关联映射和群体遗传参数估计的统计框架。

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

机构信息

Medical Population Genetics Program, Broad Institute, 7 Cambridge Center, Cambridge, MA 02142, USA.

出版信息

Bioinformatics. 2011 Nov 1;27(21):2987-93. doi: 10.1093/bioinformatics/btr509. Epub 2011 Sep 8.

Abstract

MOTIVATION

Most existing methods for DNA sequence analysis rely on accurate sequences or genotypes. However, in applications of the next-generation sequencing (NGS), accurate genotypes may not be easily obtained (e.g. multi-sample low-coverage sequencing or somatic mutation discovery). These applications press for the development of new methods for analyzing sequence data with uncertainty.

RESULTS

We present a statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation. On real data, we demonstrate that our method achieves comparable accuracy to alternative methods for estimating site allele count, for inferring allele frequency spectrum and for association mapping. We also highlight the necessity of using symmetric datasets for finding somatic mutations and confirm that for discovering rare events, mismapping is frequently the leading source of errors.

AVAILABILITY

http://samtools.sourceforge.net.

CONTACT

hengli@broadinstitute.org.

摘要

动机

大多数现有的 DNA 序列分析方法都依赖于准确的序列或基因型。然而,在下一代测序(NGS)的应用中,准确的基因型可能不容易获得(例如多样本低覆盖测序或体细胞突变发现)。这些应用迫切需要开发新的方法来分析具有不确定性的序列数据。

结果

我们提出了一种基于测序数据的统计框架,用于直接调用 SNP、发现体细胞突变、推断群体遗传参数和进行关联测试,而无需显式基因分型或基于连锁的插补。在真实数据上的实验表明,我们的方法在估计位点等位基因计数、推断等位基因频率谱和关联作图方面的准确性可与替代方法相媲美。我们还强调了使用对称数据集寻找体细胞突变的必要性,并证实对于发现稀有事件,错配通常是错误的主要来源。

可用性

http://samtools.sourceforge.net。

联系人

hengli@broadinstitute.org

引用本文的文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索