Richardson Rodney T, Bengtsson-Palme Johan, Gardiner Mary M, Johnson Reed M
Department of Entomology, Ohio State University, Columbus, OH, United States of America.
Department of Infectious Diseases, Institute of Biomedicine, The Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden.
PeerJ. 2018 Jun 26;6:e5126. doi: 10.7717/peerj.5126. eCollection 2018.
Metabarcoding is a popular application which warrants continued methods optimization. To maximize barcoding inferences, hierarchy-based sequence classification methods are increasingly common. We present methods for the construction and curation of a database designed for hierarchical classification of a 157 bp barcoding region of the arthropod cytochrome c oxidase subunit I (COI) locus. We produced a comprehensive arthropod COI amplicon dataset including annotated arthropod COI sequences and COI sequences extracted from arthropod whole mitochondrion genomes, the latter of which provided the only source of representation for Zoraptera, Callipodida and Holothyrida. The database contains extracted sequences of the target amplicon from all major arthropod clades, including all insect orders, all arthropod classes and Onychophora, Tardigrada and Mollusca outgroups. During curation, we extracted the COI region of interest from approximately 81 percent of the input sequences, corresponding to 73 percent of the genus-level diversity found in the input data. Further, our analysis revealed a high degree of sequence redundancy within the NCBI nucleotide database, with a mean of approximately 11 sequence entries per species in the input data. The curated, low-redundancy database is included in the Metaxa2 sequence classification software (http://microbiology.se/software/metaxa2/). Using this database with the Metaxa2 classifier, we performed a cross-validation analysis to characterize the relationship between the Metaxa2 reliability score, an estimate of classification confidence, and classification error probability. We used this analysis to select a reliability score threshold which minimized error. We then estimated classification sensitivity, false discovery rate and overclassification, the propensity to classify sequences from taxa not represented in the reference database. Our work will help researchers design and evaluate classification databases and conduct metabarcoding on arthropods and alternate taxa.
代谢物条形码技术是一种广受欢迎的应用,需要持续优化方法。为了最大化条形码推断,基于层次结构的序列分类方法越来越普遍。我们提出了构建和管理数据库的方法,该数据库旨在对节肢动物细胞色素c氧化酶亚基I(COI)基因座的157bp条形码区域进行层次分类。我们生成了一个全面的节肢动物COI扩增子数据集,包括注释的节肢动物COI序列和从节肢动物全线粒体基因组中提取的COI序列,后者是缺翅目、球马陆目和全气门目唯一的代表性来源。该数据库包含来自所有主要节肢动物类群的目标扩增子的提取序列,包括所有昆虫目、所有节肢动物纲以及外群有爪动物门、缓步动物门和软体动物门。在管理过程中,我们从大约81%的输入序列中提取了感兴趣的COI区域,这相当于输入数据中73%的属级多样性。此外,我们的分析揭示了NCBI核苷酸数据库中存在高度的序列冗余,输入数据中每个物种平均约有11个序列条目。经过管理的低冗余数据库包含在Metaxa2序列分类软件(http://microbiology.se/software/metaxa2/)中。使用该数据库和Metaxa2分类器,我们进行了交叉验证分析,以表征Metaxa2可靠性评分(分类置信度的估计值)与分类错误概率之间的关系。我们利用该分析选择了一个使错误最小化的可靠性评分阈值。然后,我们估计了分类敏感性、错误发现率和过度分类(将参考数据库中未代表的分类群的序列进行分类的倾向)。我们的工作将帮助研究人员设计和评估分类数据库,并对节肢动物和其他分类群进行代谢物条形码分析。