Arrigo Nils, Tuszynski Jarek W, Ehrich Dorothee, Gerdes Tommy, Alvarez Nadir
Laboratory of Evolutionary Botany, Institute of Biology, University of Neuchâtel, 11 rue Emile-Argand, CH-2000 Neuchâtel, Switzerland.
BMC Bioinformatics. 2009 Jan 26;10:33. doi: 10.1186/1471-2105-10-33.
Since the transfer and application of modern sequencing technologies to the analysis of amplified fragment-length polymorphisms (AFLP), evolutionary biologists have included an increasing number of samples and markers in their studies. Although justified in this context, the use of automated scoring procedures may result in technical biases that weaken the power and reliability of further analyses.
Using a new scoring algorithm, RawGeno, we show that scoring errors--in particular "bin oversplitting" (i.e. when variant sizes of the same AFLP marker are not considered as homologous) and "technical homoplasy" (i.e. when two AFLP markers that differ slightly in size are mistakenly considered as being homologous)--induce a loss of discriminatory power, decrease the robustness of results and, in extreme cases, introduce erroneous information in genetic structure analyses. In the present study, we evaluate several descriptive statistics that can be used to optimize the scoring of the AFLP analysis, and we describe a new statistic, the information content per bin (Ibin) that represents a valuable estimator during the optimization process. This statistic can be computed at any stage of the AFLP analysis without requiring the inclusion of replicated samples. Finally, we show that downstream analyses are not equally sensitive to scoring errors. Indeed, although a reasonable amount of flexibility is allowed during the optimization of the scoring procedure without causing considerable changes in the detection of genetic structure patterns, notable discrepancies are observed when estimating genetic diversities from differently scored datasets.
Our algorithm appears to perform as well as a commercial program in automating AFLP scoring, at least in the context of population genetics or phylogeographic studies. To our knowledge, RawGeno is the only freely available public-domain software for fully automated AFLP scoring, from electropherogram files to user-defined working binary matrices. RawGeno was implemented in an R CRAN package (with an user-friendly GUI) and can be found at http://sourceforge.net/projects/rawgeno.
自从现代测序技术被应用于扩增片段长度多态性(AFLP)分析以来,进化生物学家在其研究中纳入的样本和标记数量不断增加。尽管在这种情况下是合理的,但使用自动评分程序可能会导致技术偏差,从而削弱进一步分析的效力和可靠性。
使用一种新的评分算法RawGeno,我们发现评分错误——特别是“bin过度拆分”(即当同一AFLP标记的不同变体大小不被视为同源时)和“技术平行进化”(即当大小略有不同的两个AFLP标记被错误地视为同源时)——会导致鉴别力丧失,降低结果的稳健性,在极端情况下,还会在遗传结构分析中引入错误信息。在本研究中,我们评估了几种可用于优化AFLP分析评分的描述性统计量,并描述了一种新的统计量,即每个bin的信息含量(Ibin),它在优化过程中是一个有价值的估计量。这个统计量可以在AFLP分析的任何阶段计算,无需包含重复样本。最后,我们表明下游分析对评分错误的敏感度并不相同。事实上,虽然在优化评分程序时允许一定程度的灵活性,而不会在遗传结构模式的检测中引起相当大的变化,但在从不同评分的数据集估计遗传多样性时,会观察到明显的差异。
我们的算法在自动化AFLP评分方面似乎与商业程序表现相当,至少在群体遗传学或系统地理学研究的背景下是这样。据我们所知,RawGeno是唯一一款可免费获取的用于从电泳图文件到用户定义的工作二进制矩阵进行全自动AFLP评分的公共领域软件。RawGeno是在一个R CRAN包中实现的(带有用户友好的图形用户界面),可在http://sourceforge.net/projects/rawgeno上找到。