Forneris Natalia S, Legarra Andres, Vitezica Zulma G, Tsuruta Shogo, Aguilar Ignacio, Misztal Ignacy, Cantet Rodolfo J C
Departamento de Producción Animal, Facultad de Agronomía, Universidad de Buenos Aires, C1417DSE Buenos Aires, Argentina Consejo Nacional de Investigaciones Científicas y Técnicas, Av. Rivadavia 1917, C1033AAJ Buenos Aires, Argentina.
INRA, Génétique, Physiologie et Systèmes d'Elevage (GenPhySE), F-31326 Castanet-Tolosan, France Université de Toulouse, INP, ENSAT, Génétique, Physiologie et Systèmes d'Elevage (GenPhySE), F-31326 Castanet-Tolosan, France
Genetics. 2015 Mar;199(3):675-81. doi: 10.1534/genetics.114.173559. Epub 2015 Jan 6.
Quality control filtering of single-nucleotide polymorphisms (SNPs) is a key step when analyzing genomic data. Here we present a practical method to identify low-quality SNPs, meaning markers whose genotypes are wrongly assigned for a large proportion of individuals, by estimating the heritability of gene content at each marker, where gene content is the number of copies of a particular reference allele in a genotype of an animal (0, 1, or 2). If there is no mutation at the marker, gene content has an additive heritability of 1 by construction. The method uses restricted maximum likelihood (REML) to estimate heritability of gene content at each SNP and also builds a likelihood-ratio test statistic to test for zero error variance in genotyping. As a by-product, estimates of the allele frequencies of markers at the base population are obtained. Using simulated data with 10% permutation error (4% actual error) in genotyping, the method had a specificity of 0.96 (4% of correct markers are rejected) and a sensitivity of 0.99 (1% of wrong markers are accepted) if markers with heritability lower than 0.975 are discarded. Checking of Mendelian errors resulted in a lower sensitivity (0.84) for the same simulation. The proposed method is further illustrated with a real data set with genotypes from 3534 animals genotyped for 50,433 markers from the Illumina PorcineSNP60 chip and a pedigree of 6473 individuals; those markers underwent very little quality control. A total of 4099 markers with P-values lower than 0.01 were discarded based on our method, with associated estimates of heritability as low as 0.12. Contrary to other techniques, our method uses all information in the population simultaneously, can be used in any population with markers and pedigree recordings, and is simple to implement using standard software for REML estimation. Scripts for its use are provided.
单核苷酸多态性(SNP)的质量控制筛选是分析基因组数据时的关键步骤。在此,我们提出一种实用方法来识别低质量SNP,即那些在很大比例个体中基因型被错误分配的标记,通过估计每个标记处基因含量的遗传力来实现,其中基因含量是动物基因型中特定参考等位基因的拷贝数(0、1或2)。如果标记处没有突变,基因含量通过构建具有加性遗传力1。该方法使用限制最大似然法(REML)来估计每个SNP处基因含量的遗传力,还构建了一个似然比检验统计量来检验基因分型中的零误差方差。作为副产品,可获得基础群体中标记等位基因频率的估计值。使用在基因分型中具有10%置换误差(4%实际误差)的模拟数据,如果丢弃遗传力低于0.975的标记,该方法的特异性为0.96(4%的正确标记被拒绝),灵敏度为0.99(1%的错误标记被接受)。对孟德尔误差的检查在相同模拟中导致较低的灵敏度(0.84)。用一个真实数据集进一步说明了所提出的方法,该数据集包含来自Illumina猪SNP60芯片的50433个标记且基因型为3534只动物的数据以及一个6473个个体的系谱;那些标记几乎没有经过质量控制。基于我们的方法,总共丢弃了4099个P值低于0.01的标记,其相关遗传力估计值低至0.12。与其他技术不同,我们的方法同时使用群体中的所有信息,可用于任何有标记和系谱记录的群体,并且使用用于REML估计的标准软件易于实现。提供了使用该方法的脚本。