Fiannaca Antonino, La Rosa Massimo, Rizzo Riccardo, Urso Alfonso
Institute of High-Performance Computing and Networking, National Research Council of Italy, Viale delle Scienze, Ed. 11, 90128 Palermo, Italy.
Institute of High-Performance Computing and Networking, National Research Council of Italy, Viale delle Scienze, Ed. 11, 90128 Palermo, Italy.
Artif Intell Med. 2015 Jul;64(3):173-84. doi: 10.1016/j.artmed.2015.06.002. Epub 2015 Jul 4.
In this paper, an alignment-free method for DNA barcode classification that is based on both a spectral representation and a neural gas network for unsupervised clustering is proposed.
In the proposed methodology, distinctive words are identified from a spectral representation of DNA sequences. A taxonomic classification of the DNA sequence is then performed using the sequence signature, i.e., the smallest set of k-mers that can assign a DNA sequence to its proper taxonomic category. Experiments were then performed to compare our method with other supervised machine learning classification algorithms, such as support vector machine, random forest, ripper, naïve Bayes, ridor, and classification tree, which also consider short DNA sequence fragments of 200 and 300 base pairs (bp). The experimental tests were conducted over 10 real barcode datasets belonging to different animal species, which were provided by the on-line resource "Barcode of Life Database".
The experimental results showed that our k-mer-based approach is directly comparable, in terms of accuracy, recall and precision metrics, with the other classifiers when considering full-length sequences. In addition, we demonstrate the robustness of our method when a classification is performed task with a set of short DNA sequences that were randomly extracted from the original data. For example, the proposed method can reach the accuracy of 64.8% at the species level with 200-bp fragments. Under the same conditions, the best other classifier (random forest) reaches the accuracy of 20.9%.
Our results indicate that we obtained a clear improvement over the other classifiers for the study of short DNA barcode sequence fragments.
本文提出一种基于频谱表示和用于无监督聚类的神经气体网络的无比对DNA条形码分类方法。
在所提出的方法中,从DNA序列的频谱表示中识别出独特的单词。然后使用序列签名(即能够将DNA序列分配到其正确分类类别的最小k-mer集合)对DNA序列进行分类学分类。然后进行实验,将我们的方法与其他监督机器学习分类算法进行比较,如支持向量机、随机森林、Ripper、朴素贝叶斯、Ridor和分类树,这些算法也考虑200和300个碱基对(bp)的短DNA序列片段。实验测试是在属于不同动物物种的10个真实条形码数据集上进行的,这些数据集由在线资源“生命条形码数据库”提供。
实验结果表明,在考虑全长序列时,我们基于k-mer的方法在准确性、召回率和精确率指标方面与其他分类器直接可比。此外,当使用从原始数据中随机提取的一组短DNA序列执行分类任务时,我们证明了我们方法的稳健性。例如,所提出的方法使用200-bp片段在物种水平上可以达到64.8%的准确率。在相同条件下,最佳的其他分类器(随机森林)达到20.9%的准确率。
我们的结果表明,在研究短DNA条形码序列片段方面,我们相对于其他分类器有明显的改进。