Anderson Michael P, Dubnicka Suzanne R
Stat Appl Genet Mol Biol. 2014 Aug;13(4):423-34. doi: 10.1515/sagmb-2013-0025.
DNA barcodes are short strands of 255-700 nucleotide bases taken from the cytochrome c oxidase subunit 1 (COI) region of the mitochondrial DNA. It has been proposed that these barcodes may be used as a method of differentiating between biological species. Current methods of species classification utilize distance measures that are heavily dependent on both evolutionary model assumptions as well as a clearly defined "gap" between intra- and interspecies variation. Such distance measures fail to measure classification uncertainty or to indicate how much of the barcode is necessary for classification. We propose a sequential naïve Bayes classifier for species classification to address these limitations. The proposed method is shown to provide accurate species-level classification on real and simulated data. The method proposed here quantifies the uncertainty of each classification and addresses how much of the barcode is necessary.
DNA条形码是从线粒体DNA的细胞色素c氧化酶亚基1(COI)区域提取的255至700个核苷酸碱基的短链。有人提出,这些条形码可作为区分生物物种的一种方法。当前的物种分类方法使用的距离度量严重依赖于进化模型假设以及种内和种间变异之间明确界定的“差距”。这种距离度量无法衡量分类的不确定性,也无法表明分类需要多少条形码。我们提出一种用于物种分类的顺序朴素贝叶斯分类器来解决这些局限性。所提出的方法在真实数据和模拟数据上均能提供准确的物种水平分类。这里提出的方法量化了每次分类的不确定性,并解决了需要多少条形码的问题。