Department of Life Sciences, University of Trieste, via Giorgieri 5, 34127, Trieste, Italy.
Division of Oceanography, National Institute of Oceanography and Applied Geophysics, via Piccard 54, 34151, Trieste, Italy.
Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baz155.
DNA metabarcoding combines DNA barcoding with high-throughput sequencing to identify different taxa within environmental communities. The ITS has already been proposed and widely used as universal barcode marker for plants, but a comprehensive, updated and accurate reference dataset of plant ITS sequences has not been available so far. Here, we constructed reference datasets of Viridiplantae ITS1, ITS2 and entire ITS sequences including both Chlorophyta and Streptophyta. The sequences were retrieved from NCBI, and the ITS region was extracted. The sequences underwent identity check to remove misidentified records and were clustered at 99% identity to reduce redundancy and computational effort. For this step, we developed a script called 'better clustering for QIIME' (bc4q) to ensure that the representative sequences are chosen according to the composition of the cluster at a different taxonomic level. The three datasets obtained with the bc4q script are PLANiTS1 (100 224 sequences), PLANiTS2 (96 771 sequences) and PLANiTS (97 550 sequences), and all are pre-formatted for QIIME, being this the most used bioinformatic pipeline for metabarcoding analysis. Being curated and updated reference databases, PLANiTS1, PLANiTS2 and PLANiTS are proposed as a reliable, pivotal first step for a general standardization of plant DNA metabarcoding studies. The bc4q script is presented as a new tool useful in each research dealing with sequences clustering. Database URL: https://github.com/apallavicini/bc4q; https://github.com/apallavicini/PLANiTS.
DNA 代谢组学将 DNA 条形码与高通量测序相结合,用于识别环境群落中的不同分类群。ITS 已经被提议并广泛用于植物的通用条形码标记,但到目前为止,还没有一个全面、更新和准确的植物 ITS 序列参考数据集。在这里,我们构建了 Viridiplantae ITS1、ITS2 和整个 ITS 序列的参考数据集,包括绿藻门和苔藓植物门。这些序列从 NCBI 中检索出来,并提取了 ITS 区域。对序列进行身份检查以去除错误识别的记录,并在 99%的同一性下聚类以减少冗余和计算工作量。为此,我们开发了一个名为“用于 QIIME 的更好聚类”(bc4q)的脚本,以确保根据不同分类水平的聚类组成选择代表序列。使用 bc4q 脚本获得的三个数据集是 PLANiTS1(100224 个序列)、PLANiTS2(96771 个序列)和 PLANiTS(97550 个序列),所有这些都预先为 QIIME 格式化,这是最常用于代谢组学分析的生物信息学管道。PLANiTS1、PLANiTS2 和 PLANiTS 作为经过策展和更新的参考数据库,被提议作为植物 DNA 代谢组学研究标准化的可靠、关键的第一步。bc4q 脚本被认为是每个涉及序列聚类的研究的有用新工具。数据库网址:https://github.com/apallavicini/bc4q;https://github.com/apallavicini/PLANiTS。