Malde Ketil
BMC Genomics. 2014;15 Suppl 6(Suppl 6):S20. doi: 10.1186/1471-2164-15-S6-S20. Epub 2014 Oct 17.
High-throughput sequencing is a cost effective method for identifying genetic variation, and it is currently in use on a large scale across the field of biology, including ecology and population genetics. Correctly identifying variable sites and allele frequencies from sequencing data remains challenging, in large part due to artifacts and biases inherent in the sequencing process. Selecting variants that are diagnostic is commonly done using diversity statistics like FST, but these measures are not ideal for the task.
Here, we develop a method that directly calculates the expected amount of information gained from observing each variant site. We then develop and implement a conservative estimator that takes into account uncertainity introduced by sampling bias and sequencing error. This estimator is applied to simulated and real sequencing data, and we discuss how it performs compared to the commonly used existing methods for identifying diagnostic polymorphisms.
The expected information content gives an easy to interpret measure for the usefulness of variant sites. The results show that we achieve a clear separation between true variants and noise, allowing us to select candidate sites with a high degree of confidence.
高通量测序是一种识别基因变异的经济高效方法,目前在包括生态学和群体遗传学在内的整个生物学领域大规模应用。从测序数据中正确识别可变位点和等位基因频率仍然具有挑战性,这在很大程度上是由于测序过程中固有的假象和偏差。通常使用诸如FST等多样性统计量来选择具有诊断性的变异,但这些方法并不理想。
在此,我们开发了一种直接计算从观察每个变异位点获得的预期信息量的方法。然后,我们开发并实施了一种保守估计器,该估计器考虑了抽样偏差和测序错误引入的不确定性。该估计器应用于模拟和真实测序数据,并且我们讨论了与常用的现有诊断多态性识别方法相比它的表现如何。
预期信息含量为变异位点的有用性提供了一种易于解释的度量。结果表明,我们在真实变异和噪声之间实现了清晰的区分,使我们能够高度自信地选择候选位点。