Department of Integrative Biology, University of Guelph. Guelph, Ontario, Canada.
Centre for Biodiversity Genomics, Biodiversity Institute of Ontario, University of Guelph. Guelph, Ontario, Canada.
Genome. 2020 Jun;63(6):291-305. doi: 10.1139/gen-2019-0206. Epub 2020 May 14.
Biological conclusions based on DNA barcoding and metabarcoding analyses can be strongly influenced by the methods utilized for data generation and curation, leading to varying levels of success in the separation of biological variation from experimental error. The 5' region of cytochrome oxidase subunit I (COI-5P) is the most common barcode gene for animals, with conserved structure and function that allows for biologically informed error identification. Here, we present coil ( https://CRAN.R-project.org/package=coil ), an R package for the pre-processing and frameshift error assessment of COI-5P animal barcode and metabarcode sequence data. The package contains functions for placement of barcodes into a common reading frame, accurate translation of sequences to amino acids, and highlighting insertion and deletion errors. The analysis of 10 000 barcode sequences of varying quality demonstrated how coil can place barcode sequences in reading frame and distinguish sequences containing indel errors from error-free sequences with greater than 97.5% accuracy. Package limitations were tested through the analysis of COI-5P sequences from the plant and fungal kingdoms as well as the analysis of potential contaminants: nuclear mitochondrial pseudogenes and COI-5P sequences. Results demonstrated that coil is a strong technical error identification method but is not reliable for detecting all biological contaminants.
基于 DNA 条形码和代谢条形码分析的生物学结论可能会受到用于数据生成和管理的方法的强烈影响,从而导致从实验误差中分离生物学变异的成功率不同。细胞色素氧化酶亚基 I(COI-5P)的 5'区域是动物最常用的条形码基因,具有保守的结构和功能,允许进行生物信息错误识别。在这里,我们介绍 coil(https://CRAN.R-project.org/package=coil),这是一个用于动物 COI-5P 条形码和代谢条形码序列数据预处理和移码错误评估的 R 包。该软件包包含用于将条形码放入通用阅读框、将序列准确翻译成氨基酸以及突出插入和删除错误的功能。对质量不同的 10000 个条形码序列的分析表明,coil 如何将条形码序列放入阅读框,并以超过 97.5%的准确率区分包含插入缺失错误的序列和无错误序列。通过分析植物和真菌王国的 COI-5P 序列以及潜在污染物(核线粒体假基因和 COI-5P 序列)来测试软件包的限制。结果表明,coil 是一种强大的技术错误识别方法,但不能可靠地检测所有生物污染物。