Science & Education, Integrative Research Center, The Field Museum, 1400 South Lake Shore Drive, Chicago, IL, 60605-2496, USA,
J Mol Evol. 2014 Feb;78(2):148-62. doi: 10.1007/s00239-013-9603-y. Epub 2013 Dec 17.
The internal transcribed spacer region (ITS) of the nuclear rDNA cistron represents the barcoding locus for Fungi. Intragenomic variation of this multicopy gene can interfere with accurate phylogenetic reconstruction of biological entities. We investigated the amount and nature of this variation for the lichenized fungus Cora inversa in the Hygrophoraceae (Basidiomycota: Agaricales), analyzing base call and length variation in ITS1 454 pyrosequencing data of three samples of the target mycobiont, for a total of 16,665 reads obtained from three separate repeats of the same samples under different conditions. Using multiple fixed alignment methods (PaPaRa) and maximum likelihood phylogenetic analysis (RAxML), we assessed phylogenetic relationships of the obtained reads, together with Sanger ITS sequences from the same samples. Phylogenetic analysis showed that all ITS1 reads belonged to a single species, C. inversa. Pyrosequencing data showed 266 insertion sites in addition to the 325 sites expected from Sanger sequences, for a total of 15,654 insertions (0.94 insertions per read). An additional 3,279 substitutions relative to the Sanger sequences were detected in the dataset, out of 5,461,125 bases to be called. Up to 99.3% of the observed indels in the dataset could be interpreted as 454 pyrosequencing errors, approximately 65% corresponding to incorrectly recovered homopolymer segments, and 35% to carry-forward-incomplete-extension errors. Comparison of automated clustering and alignment-based phylogenetic analysis demonstrated that clustering of these reads produced a 35-fold overestimation of biological diversity in the dataset at the 95% similarity threshold level, whereas phylogenetic analysis using a maximum likelihood approach accurately recovered a single biological entity. We conclude that variation detected in 454 pyrosequencing data must be interpreted with great care and that a combination of a sufficiently large number of reads per taxon, a set of Sanger references for the same taxon, and at least two runs under different emulsion PCR and sequencing conditions, are necessary to reliably separate biological variation from 454 sequencing errors. Our study shows that clustering methods are highly sensitive to artifactual sequence variation and inadequate to properly recover biological diversity in a dataset, if sequencing errors are substantial and not removed prior to clustering analysis.
核 rDNA 基因座的内转录间隔区(ITS)代表真菌的条形码基因座。这种多拷贝基因的基因组内变异可能会干扰生物实体的准确系统发育重建。我们研究了 Hygrophoraceae(担子菌门:伞菌目)中的地衣真菌 Cora inversa 的这种变异的数量和性质,分析了目标共生菌的三个样本的 454 焦磷酸测序数据中的碱基调用和长度变化,共获得来自三个单独样本在不同条件下重复的 16665 个读数。使用多种固定对齐方法(PaPaRa)和最大似然系统发育分析(RAxML),我们评估了获得的读数的系统发育关系,以及来自相同样本的 Sanger ITS 序列。系统发育分析表明,所有 ITS1 读数都属于单一物种 C. inversa。焦磷酸测序数据显示,除了 Sanger 序列预期的 325 个位点外,还有 266 个插入位点,总共有 15654 个插入(每个读数 0.94 个插入)。在数据集的 5461125 个可调用碱基中,检测到相对于 Sanger 序列的另外 3279 个替换。在数据集的观察到的插入缺失中,高达 99.3%可以解释为 454 焦磷酸测序错误,大约 65%对应于不正确恢复的同源多聚体片段,35%对应于前向不完全延伸错误。自动聚类与基于对齐的系统发育分析的比较表明,在 95%相似性阈值水平下,这些读数的聚类会使数据集的生物多样性高估 35 倍,而使用最大似然方法的系统发育分析则准确地恢复了单一的生物实体。我们得出结论,必须谨慎解释 454 焦磷酸测序数据中检测到的变异,并且需要为每个分类群提供足够数量的读取,为同一分类群设置一组 Sanger 参考,并且至少在两种不同的乳液 PCR 和测序条件下运行,以可靠地将生物变异与 454 测序错误分开。我们的研究表明,如果测序错误很大且在聚类分析之前未被去除,聚类方法对人为序列变异非常敏感,并且不足以正确恢复数据集中的生物多样性。