Curd Emily E, Gal Luna, Gallego Ramon, Nielsen Shaun, Gold Zachary
Vermont Biomedical Research Network, University of Vermont, VT, USA.
Landmark College, VT, USA.
bioRxiv. 2023 Jun 3:2023.05.31.543005. doi: 10.1101/2023.05.31.543005.
Key to making accurate taxonomic assignments are curated, comprehensive reference barcode databases. However, the generation and curation of such databases has remained challenging given the large and continuously growing volumes of DNA sequence data and novel reference barcode targets. Monitoring and research applications require a greater diversity of specialized gene regions and targeted taxa to meet taxonomic classification goals then are currently curated by professional staff. Thus, there is a growing need for an easy to implement tool that can generate comprehensive metabarcoding reference libraries for any bespoke locus. We address this need by reimagining CRUX from the Anacapa Toolkit and present the rCRUX package in R. The typical workflow involves searching for plausible seed amplicons (() or ()) by simulating PCR to acquire seed sequences containing a user-defined primer set. Next these seeds are used to iteratively blast search seed sequences against a local NCBI formatted database using a taxonomic rank based stratified random sampling approach (()) that results in a comprehensive set of sequence matches. This database is dereplicated and cleaned (()) by identifying identical reference sequences and collapsing the taxonomic path to the lowest taxonomic agreement across all matching reads. This results in a curated, comprehensive database of primer specific reference barcode sequences from NCBI. We demonstrate that rCRUX provides more comprehensive reference databases for the MiFish Universal Teleost 12S, Taberlet trnl, and fungal ITS locus than CRABS, METACURATOR, RESCRIPt, and ECOPCR reference databases. We then further demonstrate the utility of rCRUX by generating 16 reference databases for metabarcoding loci that lack dedicated reference database curation efforts. The rCRUX package provides a simple to use tool for the generation of curated, comprehensive reference databases for user-defined loci, facilitating accurate and effective taxonomic classification of metabarcoding and DNA sequence efforts broadly.
进行准确的分类学赋值的关键在于经过整理的、全面的参考条形码数据库。然而,鉴于DNA序列数据量庞大且持续增长,以及新的参考条形码目标,此类数据库的生成和整理一直具有挑战性。监测和研究应用需要比专业人员目前整理的更多样化的专门基因区域和目标分类群,以实现分类学分类目标。因此,越来越需要一种易于实施的工具,该工具可以为任何定制位点生成全面的元条形码参考文库。我们通过重新构想来自阿纳卡帕工具包的CRUX来满足这一需求,并在R语言中展示了rCRUX包。典型的工作流程包括通过模拟PCR搜索合理的种子扩增子(()或()),以获取包含用户定义引物集的种子序列。接下来,使用基于分类等级的分层随机抽样方法(()),将这些种子用于对本地NCBI格式数据库进行种子序列的迭代比对搜索,从而得到一组全面的序列匹配结果。通过识别相同的参考序列并将分类路径合并到所有匹配读数中最低的分类学一致性,对该数据库进行重复数据删除和清理(())。这就产生了一个来自NCBI的经过整理的、全面的引物特异性参考条形码序列数据库。我们证明,与CRABS、METACURATOR、RESCRIPt和ECOPCR参考数据库相比,rCRUX为MiFish通用硬骨鱼12S、塔贝莱trnl和真菌ITS位点提供了更全面的参考数据库。然后,我们通过为缺乏专门参考数据库整理工作的元条形码位点生成16个参考数据库,进一步证明了rCRUX的实用性。rCRUX包为生成针对用户定义位点的经过整理的、全面的参考数据库提供了一个易于使用的工具,广泛促进了元条形码和DNA序列工作的准确有效分类。