联合使用池化和插补进行 SNP 基因分型。

A joint use of pooling and imputation for genotyping SNPs.

机构信息

Division of Scientific Computing, Department of Information Technology, Uppsala University, Lägerhyddsvägen 1, hus 10, 75237, Uppsala, Sweden.

出版信息

BMC Bioinformatics. 2022 Oct 13;23(1):421. doi: 10.1186/s12859-022-04974-7.

DOI:10.1186/s12859-022-04974-7

PMID:36229780

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9563787/

Abstract

BACKGROUND

Despite continuing technological advances, the cost for large-scale genotyping of a high number of samples can be prohibitive. The purpose of this study is to design a cost-saving strategy for SNP genotyping. We suggest making use of pooling, a group testing technique, to drop the amount of SNP arrays needed. We believe that this will be of the greatest importance for non-model organisms with more limited resources in terms of cost-efficient large-scale chips and high-quality reference genomes, such as application in wildlife monitoring, plant and animal breeding, but it is in essence species-agnostic. The proposed approach consists in grouping and mixing individual DNA samples into pools before testing these pools on bead-chips, such that the number of pools is less than the number of individual samples. We present a statistical estimation algorithm, based on the pooling outcomes, for inferring marker-wise the most likely genotype of every sample in each pool. Finally, we input these estimated genotypes into existing imputation algorithms. We compare the imputation performance from pooled data with the Beagle algorithm, and a local likelihood-aware phasing algorithm closely modeled on MaCH that we implemented.

RESULTS

We conduct simulations based on human data from the 1000 Genomes Project, to aid comparison with other imputation studies. Based on the simulated data, we find that pooling impacts the genotype frequencies of the directly identifiable markers, without imputation. We also demonstrate how a combinatorial estimation of the genotype probabilities from the pooling design can improve the prediction performance of imputation models. Our algorithm achieves 93% concordance in predicting unassayed markers from pooled data, thus it outperforms the Beagle imputation model which reaches 80% concordance. We observe that the pooling design gives higher concordance for the rare variants than traditional low-density to high-density imputation commonly used for cost-effective genotyping of large cohorts.

CONCLUSIONS

We present promising results for combining a pooling scheme for SNP genotyping with computational genotype imputation on human data. These results could find potential applications in any context where the genotyping costs form a limiting factor on the study size, such as in marker-assisted selection in plant breeding.

摘要

背景

尽管技术不断进步，但对大量样本进行大规模 SNP 基因分型的成本可能过高。本研究旨在设计一种节约 SNP 基因分型成本的策略。我们建议利用池化（一种群组测试技术）来减少所需 SNP 芯片的数量。我们相信，对于成本效益高的大规模芯片和高质量参考基因组资源有限的非模式生物来说，这将是最重要的，例如在野生动物监测、动植物育种中的应用，但从本质上讲，这对物种是无差别的。该方法包括在对珠芯片进行测试之前，将个体 DNA 样本分组并混合到池子里，从而使池的数量小于个体样本的数量。我们提出了一种基于池化结果的统计估计算法，用于推断每个池中的每个样本的最可能基因型。最后，我们将这些估计的基因型输入到现有的推断算法中。我们比较了来自混合数据的推断性能与 Beagle 算法和我们实现的密切模拟 MaCH 的局部似然感知相位算法的性能。

结果

我们基于 1000 基因组计划的人类数据进行了模拟，以帮助与其他推断研究进行比较。基于模拟数据，我们发现混合会影响直接可识别标记的基因型频率，而不会进行推断。我们还展示了如何从池化设计中组合估计基因型概率来提高推断模型的预测性能。我们的算法在预测来自混合数据的未检测标记方面达到 93%的一致性，因此优于 Beagle 推断模型（达到 80%的一致性）。我们观察到，与传统的用于低成本高效基因分型的低密度到高密度混合方法相比，池化设计对稀有变异的一致性更高。