Yang Cheng-Hong, Wu Kuo-Chuan, Chuang Li-Yeh, Chang Hsueh-Wei
Department of Electronic Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan.
Graduate Institute of Clinical Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan.
Evol Bioinform Online. 2018 Mar 5;14:1176934318760856. doi: 10.1177/1176934318760856. eCollection 2018.
DNA barcode sequences are accumulating in large data sets. A barcode is generally a sequence larger than 1000 base pairs and generates a computational burden. Although the DNA barcode was originally envisioned as straightforward species tags, the identification usage of barcode sequences is rarely emphasized currently. Single-nucleotide polymorphism (SNP) association studies provide us an idea that the SNPs may be the ideal target of feature selection to discriminate between different species. We hypothesize that SNP-based barcodes may be more effective than the full length of DNA barcode sequences for species discrimination. To address this issue, we tested a ibulose diphosphate carboxylase () NP arcoding (RSB) strategy using a decision tree algorithm. After alignment and trimming, 31 SNPs were discovered in the sequences from 38 Brassicaceae plant species. In the decision tree construction, these SNPs were computed to set up the decision rule to assign the sequences into 2 groups level by level. After algorithm processing, 37 nodes and 31 loci were required for discriminating 38 species. Finally, the sequence tags consisting of 31 SNP barcodes were identified for discriminating 38 Brassicaceae species based on the decision tree-selected SNP pattern using RSB method. Taken together, this study provides the rational that the SNP aspect of DNA barcode for gene is a useful and effective sequence for tagging 38 Brassicaceae species.
DNA条形码序列正在大量数据集中不断积累。条形码通常是一段长度超过1000个碱基对的序列,会产生计算负担。尽管DNA条形码最初被设想为简单的物种标签,但目前条形码序列的识别用途很少被强调。单核苷酸多态性(SNP)关联研究让我们想到,SNP可能是区分不同物种的理想特征选择目标。我们假设基于SNP的条形码在物种鉴别方面可能比全长DNA条形码序列更有效。为解决这个问题,我们使用决策树算法测试了一种核酮糖二磷酸羧化酶()NP条形码(RSB)策略。经过比对和修剪后,在38种十字花科植物物种的序列中发现了31个SNP。在构建决策树时,计算这些SNP以建立决策规则,将序列逐级分为两组。经过算法处理,鉴别38个物种需要37个节点和31个位点。最后,基于RSB方法,根据决策树选择的SNP模式,识别出由31个SNP条形码组成的序列标签,用于鉴别38种十字花科物种。总之,本研究提供了这样的理论依据,即用于基因的DNA条形码的SNP方面是标记38种十字花科物种的有用且有效的序列。