Borozan Ivan, Ferretti Vincent
Informatics and Bio-computing, Ontario Institute for Cancer Research, MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, Canada.
Bioinformatics. 2016 Feb 1;32(3):453-5. doi: 10.1093/bioinformatics/btv587. Epub 2015 Oct 9.
Sequence comparison of genetic material between known and unknown organisms plays a crucial role in genomics, metagenomics and phylogenetic analysis. The emerging long-read sequencing technologies can now produce reads of tens of kilobases in length that promise a more accurate assessment of their origin. To facilitate the classification of long and short DNA sequences, we have developed a Python package that implements a new sequence classification model that we have demonstrated to improve the classification accuracy when compared with other state of the art classification methods. For the purpose of validation, and to demonstrate its usefulness, we test the combined sequence similarity score classifier (CSSSCL) using three different datasets, including a metagenomic dataset composed of short reads.
Package's source code and test datasets are available under the GPLv3 license at https://github.com/oicr-ibc/cssscl.
Supplementary data are available at Bioinformatics online.
已知生物与未知生物之间遗传物质的序列比较在基因组学、宏基因组学和系统发育分析中起着至关重要的作用。新兴的长读长测序技术现在能够产生长达数十千碱基的读段,有望对其来源进行更准确的评估。为便于对长、短DNA序列进行分类,我们开发了一个Python包,该包实现了一种新的序列分类模型,与其他现有分类方法相比,我们已证明该模型可提高分类准确性。为进行验证并证明其有用性,我们使用三个不同的数据集测试了组合序列相似性评分分类器(CSSSCL),其中包括一个由短读段组成的宏基因组数据集。
该包的源代码和测试数据集可在https://github.com/oicr-ibc/cssscl上根据GPLv3许可获取。
补充数据可在《生物信息学》在线获取。