Dantec Loïck Le, Chagné David, Pot David, Cantin Olivier, Garnier-Géré Pauline, Bedon Frank, Frigerio Jean-Marc, Chaumeil Philippe, Léger Patrick, Garcia Virginie, Laigret Frédéric, De Daruvar Antoine, Plomion Christophe
Unité de Recherche sur les Espèces Fruitières et la Vigne, INRA, 71 avenue Edouard Bourlaux, BP 81, 33883 Villenave d'Ornon Cedex, France.
Plant Mol Biol. 2004 Feb;54(3):461-70. doi: 10.1023/B:PLAN.0000036376.11710.6f.
We developed an automated pipeline for the detection of single nucleotide polymorphisms (SNPs) in expressed sequence tag (EST) data sets, by combining three DNA sequence analysis programs: Phred, Phrap and PolyBayes. This application requires access to the individual electrophoregram traces. First, a reference set of 65 SNPs was obtained from the sequencing of 30 gametes in 13 maritime pine (Pinus pinaster Ait.) gene fragments (6671 bp), resulting in a frequency of 1 SNP every 102.6 bp. Second, parameters of the three programs were optimized in order to retrieve as many true SNPs, while keeping the rate of false positive as low as possible. Overall, the efficiency of detection of true SNPs was 83.1%. However, this rate varied largely as a function of the rare SNP allele frequency: down to 41% for rare SNP alleles (frequency < 10%), up to 98% for allele frequencies above 10%. Third, the detection method was applied to the 18498 assembled maritime pine (Pinus pinaster Ait.) ESTs, allowing to identify a total of 1400 candidate SNPs, in contigs containing between 4 and 20 sequence reads. These genetic resources, described for the first time in a forest tree species, were made available at http://www.pierroton.inra/genetics/Pinesnps. We also derived an analytical expression for the SNP detection probability as a function of the SNP allele frequency, the number of haploid genomes used to generate the EST sequence database, and the sample size of the contigs considered for SNP detection. The frequency of the SNP allele was shown to be the main factor influencing the probability of SNP detection.
我们通过整合三个DNA序列分析程序:Phred、Phrap和PolyBayes,开发了一种用于在表达序列标签(EST)数据集中检测单核苷酸多态性(SNP)的自动化流程。此应用需要访问各个电泳图谱。首先,从13个海岸松(Pinus pinaster Ait.)基因片段(6671 bp)的30个配子测序中获得了一组65个SNP的参考集,结果是每102.6 bp出现1个SNP的频率。其次,对这三个程序的参数进行了优化,以便检索尽可能多的真实SNP,同时将假阳性率保持在尽可能低的水平。总体而言,真实SNP的检测效率为83.1%。然而,该比率因稀有SNP等位基因频率而有很大差异:稀有SNP等位基因(频率<10%)时低至41%,等位基因频率高于10%时高达98%。第三,将检测方法应用于18498个组装的海岸松(Pinus pinaster Ait.)EST,从而在包含4至20个序列读数的重叠群中总共鉴定出1400个候选SNP。这些首次在林木物种中描述的遗传资源可在http://www.pierroton.inra/genetics/Pinesnps获取。我们还推导了一个分析表达式,用于表示SNP检测概率是SNP等位基因频率、用于生成EST序列数据库的单倍体基因组数量以及用于SNP检测的重叠群样本大小的函数。结果表明,SNP等位基因频率是影响SNP检测概率的主要因素。