Institute of Arctic Biology, University of Alaska Fairbanks, Fairbanks, AK 99775, USA.
Mol Phylogenet Evol. 2013 Apr;67(1):234-45. doi: 10.1016/j.ympev.2013.01.018. Epub 2013 Feb 9.
The number of sequences from both formally described taxa and uncultured environmental DNA deposited in the International Nucleotide Sequence Databases has increased substantially over the last two decades. Although the majority of these sequences represent authentic gene copies, there is evidence of DNA artifacts in these databases as well. These include lab artifacts, such as PCR chimeras, and biological artifacts such as pseudogenes or other paralogous sequences. Sequences that fall in basal positions in phylogenetic trees and appear distant from known sequences are particularly suspect. Phylogenetic analyses suggest that a novel sequence type (NS1) found in two boreal forest soil clone libraries belongs to the fungal kingdom but does not fall unambiguously within any known phylum. We have evaluated this sequence type using an array of secondary-structure analyses. To our knowledge, such analyses have never been used on environmental ribosomal sequences. Ribosomal secondary structure was modeled for four rRNA loci (ITS1, 5.8S, ITS2, 5' LSU). These models were analyzed for the presence of conserved domains, conserved nucleotide motifs, and compensatory base changes. Minimal free energy (MFE) foldings and GC contents of sequences representing the major fungal clades, as well as NS1, were also compared. NS1 displays secondary rRNA structures consistent with other fungi and many, but not all, conserved nucleotide motifs found across eukaryotes. However, our analyses show that many other authentic sequences from basal fungi lack more of these conserved motifs than does NS1. Together our findings suggest that NS1 represents an authentic gene copy. The methods described here can be used on any rRNA-coding sequence, not just environmental fungal sequences. As new-generation sequencing methods that yield shorter sequences become more widely implemented, methods that evaluate sequence authenticity should also be more widely implemented. For fungi, the adjacent 5.8S and ITS2 loci should be prioritized. This region is not only suited to distinguishing between closely related species, but it is also more informative in terms of expected secondary structure.
在过去的二十年中,国际核苷酸序列数据库中正式描述的分类单元和未培养的环境 DNA 的序列数量大幅增加。尽管这些序列中的大多数代表真实的基因拷贝,但这些数据库中也存在 DNA 人工制品的证据。这些包括实验室人工制品,如 PCR 嵌合体,以及生物人工制品,如假基因或其他同源序列。在系统发育树中处于基部位置且与已知序列相距较远的序列特别可疑。系统发育分析表明,在两个北方森林土壤克隆文库中发现的新型序列类型(NS1)属于真菌界,但不能明确归入任何已知的门。我们使用一系列二级结构分析来评估这种序列类型。据我们所知,这种分析从未在环境核糖体序列上使用过。对四个 rRNA 基因座(ITS1、5.8S、ITS2、5' LSU)的 rRNA 二级结构进行了建模。分析了这些模型中保守结构域、保守核苷酸基序和补偿碱基变化的存在情况。还比较了代表主要真菌类群的序列以及 NS1 的最小自由能(MFE)折叠和 GC 含量。NS1 显示出与其他真菌一致的二级 rRNA 结构,并且具有许多但不是所有真核生物中发现的保守核苷酸基序。然而,我们的分析表明,许多其他来自基础真菌的真实序列缺乏比 NS1 更多的这些保守基序。我们的研究结果表明,NS1 代表了一个真实的基因拷贝。这里描述的方法可以用于任何 rRNA 编码序列,而不仅仅是环境真菌序列。随着产生更短序列的新一代测序方法得到更广泛的应用,评估序列真实性的方法也应该得到更广泛的应用。对于真菌,应优先考虑相邻的 5.8S 和 ITS2 基因座。该区域不仅适合区分密切相关的物种,而且在预期的二级结构方面也更具信息量。