Agricultural Bioinformatics Research Unit, Graduate School of Agricultural and Life Sciences, University of Tokyo, 1-1-1 Yayoi Bunkyo-Ku, Tokyo, 113-8657, Japan.
J Mol Evol. 2010 Oct;71(4):250-67. doi: 10.1007/s00239-010-9380-9. Epub 2010 Aug 26.
Species identification is one of the most important issues in biological studies. Due to recent increases in the amount of genomic information available and the development of DNA sequencing technologies, the applicability of using DNA sequences to identify species (commonly referred to as "DNA barcoding") is being tested in many areas. Several methods have been suggested to identify species using DNA sequences, including similarity scores, analysis of phylogenetic and population genetic information, and detection of species-specific sequence patterns. Although these methods have demonstrated good performance under a range of circumstances, they also have limitations, as they are subject to loss of information, require intensive computation and are sensitive to model mis-specification, and can be difficult to evaluate in terms of the significance of identification. Here, we suggest a new DNA barcoding method in which support vector machine (SVM) procedures are adopted. Our new method is nonparametric and thus is expected to be robust for a wide range of evolutionary scenarios as well as multilocus analyses. Furthermore, we describe bootstrap procedures that can be used to test the significances of species identifications. We implemented a novel conversion technique for transforming sequence data to real-valued vectors, and therefore, bootstrap procedures can be easily combined with our SVM approach. In this study, we present the results of simulation studies and empirical data analyses to demonstrate the performance of our method and discuss its properties.
物种鉴定是生物学研究中最重要的问题之一。由于最近基因组信息量的增加和 DNA 测序技术的发展,利用 DNA 序列鉴定物种(通常称为“DNA 条形码”)的适用性正在许多领域得到检验。已经提出了几种使用 DNA 序列鉴定物种的方法,包括相似度评分、系统发生和种群遗传信息分析以及检测物种特异性序列模式。尽管这些方法在一系列情况下表现出良好的性能,但它们也存在局限性,因为它们会导致信息丢失,需要密集的计算,并且对模型的误设定敏感,并且在鉴定的显著性方面难以评估。在这里,我们建议采用支持向量机(SVM)程序的新 DNA 条形码方法。我们的新方法是非参数的,因此有望在广泛的进化情景以及多点分析中具有稳健性。此外,我们还描述了可用于测试物种鉴定显著性的自举程序。我们实施了一种新颖的转换技术,可将序列数据转换为实值向量,因此,自举程序可以很容易地与我们的 SVM 方法结合使用。在本研究中,我们呈现了模拟研究和实际数据分析的结果,以展示我们的方法的性能,并讨论其性质。