如何从二代测序（NGS）或传统序列数据中推断可靠的二倍体基因型：从基本概率到实验优化

How to infer reliable diploid genotypes from NGS or traditional sequence data: from basic probability to experimental optimization.

机构信息

IMBE (Mediterranean Institute of marine and continental Biodiversity and Ecology), CNRS, UMR 7263, Université Aix-Marseille, Institut Pythéas, France.

出版信息

J Evol Biol. 2012 May;25(5):949-60. doi: 10.1111/j.1420-9101.2012.02488.x. Epub 2012 Mar 16.

DOI:10.1111/j.1420-9101.2012.02488.x

PMID:22420488

Abstract

The use of diploid sequence markers is still challenging despite the good quality of the information they provide. There is a common problem to all sequencing approaches [traditional cloning and sequencing of PCR amplicons as well as next-generation sequencing (NGS)]: when no variation is found within the sequences from a given individual, homozygozity can never be asserted with certainty. As a consequence, sequence data from diploid markers are mostly analysed at the population (not the individual level) particularly in animal studies. This study aims at contributing to solve this. Using the Bayes theorem and the binomial law, useful results are derived, among which: (i) the number of sequence reads per individual (or sequencing depth) which is required to ensure, at a given probability threshold, that some heterozygotes are not considered erroneously as homozygotes, as a function of the observed heterozygozity (H(o) ) of the locus in the population; (ii) a way of estimating H(o) from low coverage NGS data; (iii) a way of testing the null hypothesis that a genetic marker corresponds to a single and diploid locus, in the absence of data from controlled crosses; (iv) strategies for characterizing sequence genotypes in populations minimizing the average number of sequence reads per individual; (v) a rationale to decide which are the variations that one needs to consider along the sequence, as a function of the sequencing depth affordable, the level of polymorphism desired and the risk of sequencing error. For traditional sequencing technology, optimal strategies appear surprisingly different from the usual empirical ones. The average number of sequence reads required to obtain 99% of fully determined genotypes never exceeds six, this value corresponding to the worst situation when H(o) equals 0.6. This threshold value of H(o) is strikingly stable when the tolerated proportion of nonfully resolved genotypes varies in a reasonable range. These results do not rely on the Hardy-Weinberg equilibrium assumption or on diallelism of nucleotidic sites.

摘要

尽管二倍体序列标记所提供的信息质量很高，但对其进行应用仍具有挑战性。所有测序方法（传统的PCR扩增子克隆和测序以及新一代测序（NGS））都存在一个共同问题：当在给定个体的序列中未发现变异时，永远无法确定其为纯合性。因此，来自二倍体标记的序列数据大多在群体水平（而非个体水平）进行分析，尤其是在动物研究中。本研究旨在为解决这一问题做出贡献。利用贝叶斯定理和二项式定律，得出了一些有用的结果，其中包括：（i）在给定概率阈值下，为确保某些杂合子不会被错误地视为纯合子，每个个体所需的序列读数数量（即测序深度），它是群体中该位点观察到的杂合度（H(o)）的函数；（ii）从低覆盖度NGS数据估计H(o)的方法；（iii）在没有来自受控杂交数据的情况下，检验遗传标记对应于单个二倍体位点这一零假设的方法；（iv）在群体中表征序列基因型的策略，可使每个个体的平均序列读数数量最小化；（v）根据可承受的测序深度、所需的多态性水平和测序错误风险，决定需要考虑序列中哪些变异的基本原理。对于传统测序技术，最佳策略与通常的经验策略惊人地不同。获得99%完全确定基因型所需的平均序列读数数量从不超过6个，该值对应于H(o)等于0.6时的最坏情况。当未完全解析基因型的容忍比例在合理范围内变化时，H(o)的这个阈值非常稳定。这些结果不依赖于哈迪-温伯格平衡假设或核苷酸位点的双列杂交。