基于SNP的DNA池基因分型的等位基因频率校准：一种基于回归的局部-全局误差融合方法。

Allele frequency calibration for SNP based genotyping of DNA pools: A regression based local-global error fusion method.

作者信息

Rahman Ashfaqur, Hellicar Andrew, Smith Daniel, Henshall John M

机构信息

Digital Productivity Flagship, CSIRO, Hobart, Tasmania, Australia.

出版信息

Comput Biol Med. 2015 Jun;61:48-55. doi: 10.1016/j.compbiomed.2015.03.020. Epub 2015 Mar 26.

DOI:10.1016/j.compbiomed.2015.03.020

PMID:25863000

Abstract

BACKGROUND

The costs associated with developing high density microarray technologies are prohibitive for genotyping animals when there is low economic value associated with a single animal (e.g. prawns). DNA pooling is an attempt to address this issue by combining multiple DNA samples prior to genotyping. Instead of genotyping the DNA samples of the individuals, a mixture of DNA samples (i.e. the pool) from the individuals is genotyped only once. This greatly reduces the cost of genotyping. Pooled samples are subject to greater genotyping inaccuracies than individual samples. Wrong genotyping will lead to wrong biological conclusions. It is thus required to calibrate the resulting genotypes (allele frequencies).

METHODS

We present a regression based approach to translate raw array output to allele frequency. During training, few pools and the individuals that constitute the pools are genotyped. Given the genotypes of individuals that constitute the pool, we compute the true allele frequency. We then train a regression algorithm to produce a mapping between the raw array outputs to the true allele frequency. We test the algorithm using pool samples withheld from the training set. During prediction, we use this map to genotype pools with no prior knowledge of the individuals constituting the pools.

RESULTS AND DISCUSSION

After data quality control we have available a dataset comprised of 912 pools. We estimate allele frequency using three approaches: the raw data, a commonly used piecewise linear transformation, and the proposed local-global learner fusion method. The resulting RMS errors for the three approaches are 0.135, 0.120, and 0.080 respectively.

摘要

背景

当单只动物（如对虾）的经济价值较低时，开发高密度微阵列技术用于动物基因分型的成本过高。DNA混合是一种通过在基因分型前合并多个DNA样本以解决此问题的尝试。不是对个体的DNA样本进行基因分型，而是仅对来自个体的DNA样本混合物（即混合样本）进行一次基因分型。这大大降低了基因分型的成本。混合样本比个体样本更容易出现基因分型不准确的情况。错误的基因分型会导致错误的生物学结论。因此需要对所得基因型（等位基因频率）进行校准。

方法

我们提出一种基于回归的方法，将原始阵列输出转换为等位基因频率。在训练过程中，对少量混合样本及其组成个体进行基因分型。根据构成混合样本的个体的基因型，我们计算真实的等位基因频率。然后训练回归算法，以生成原始阵列输出与真实等位基因频率之间的映射。我们使用 withheld from the training set 的混合样本测试该算法。在预测过程中，我们使用此映射对混合样本进行基因分型，而无需事先了解构成混合样本的个体情况。