Department of Mathematics and Statistics, American University, Washington, DC 20016
Horticultural Sciences Department, University of Florida, Gainesville, Florida 32611.
Genetics. 2018 Nov;210(3):789-807. doi: 10.1534/genetics.118.301468. Epub 2018 Sep 5.
Detecting and quantifying the differences in individual genomes (, genotyping), plays a fundamental role in most modern bioinformatics pipelines. Many scientists now use reduced representation next-generation sequencing (NGS) approaches for genotyping. Genotyping diploid individuals using NGS is a well-studied field, and similar methods for polyploid individuals are just emerging. However, there are many aspects of NGS data, particularly in polyploids, that remain unexplored by most methods. Our contributions in this paper are fourfold: (i) We draw attention to, and then model, common aspects of NGS data: sequencing error, allelic bias, overdispersion, and outlying observations. (ii) Many datasets feature related individuals, and so we use the structure of Mendelian segregation to build an empirical Bayes approach for genotyping polyploid individuals. (iii) We develop novel models to account for preferential pairing of chromosomes, and harness these for genotyping. (iv) We derive oracle genotyping error rates that may be used for read depth suggestions. We assess the accuracy of our method in simulations, and apply it to a dataset of hexaploid sweet potato (). An R package implementing our method is available at https://cran.r-project.org/package=updog.
检测和量化个体基因组(即基因分型)的差异,在大多数现代生物信息学流程中起着至关重要的作用。现在,许多科学家使用简化的代表性下一代测序(NGS)方法进行基因分型。使用 NGS 对二倍体个体进行基因分型是一个研究得很好的领域,而用于多倍体个体的类似方法才刚刚出现。然而,NGS 数据有许多方面,特别是在多倍体中,大多数方法都尚未涉及。我们在本文中的贡献有四点:(i)我们提请注意 NGS 数据的常见方面,然后对其进行建模:测序错误、等位基因偏倚、过度分散和异常观测。(ii)许多数据集都具有相关个体,因此我们利用孟德尔分离的结构,为多倍体个体的基因分型构建了一种经验贝叶斯方法。(iii)我们开发了新的模型来解释染色体的优先配对,并利用这些模型进行基因分型。(iv)我们推导出了可用于读取深度建议的Oracle 基因分型错误率。我们在模拟中评估了我们方法的准确性,并将其应用于六倍体甘薯的数据集()。实现我们方法的 R 包可在 https://cran.r-project.org/package=updog 上获得。