Suppr超能文献

整合基于比对和非比对的序列相似性度量用于生物序列分类。

Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.

作者信息

Borozan Ivan, Watt Stuart, Ferretti Vincent

机构信息

Department of Informatics and Bio-computing, Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 800, Toronto, Ontario, Canada.

出版信息

Bioinformatics. 2015 May 1;31(9):1396-404. doi: 10.1093/bioinformatics/btv006. Epub 2015 Jan 7.

Abstract

MOTIVATION

Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized.

RESULTS

Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences.

AVAILABILITY AND IMPLEMENTATION

All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html.

CONTACT

ivan.borozan@gmail.com

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

基于比对的序列相似性搜索对于某些类型的序列是准确的,但当应用于在许多细菌和病毒基因组中观察到的经历了序列重排的、差异较大但功能相关的序列时,可能会产生错误的结果。在此,我们提出一种分类模型,该模型利用基于比对和无比对相似性度量的互补性质,旨在提高对DNA和蛋白质序列进行特征描述的准确性。

结果

我们的模型使用通过对不同序列相似性度量的贡献进行自适应加权计算得到的组合序列相似性分数对序列进行分类。权重是针对测试集中的每个序列独立确定的,并且反映了训练集中各个相似性度量的区分能力。由于某些序列之间的相似性用一种度量比另一种度量能更准确地确定,我们的分类器允许不同的权重集与不同的序列相关联。使用五种不同的相似性度量,我们表明,在预测短病毒序列片段和完整病毒序列的分类谱系时,我们的模型相对于当前基于组成和比对的模型显著提高了分类准确性。我们还表明,我们的模型可以有效地用于对来自真实宏基因组数据集的读数以及蛋白质序列进行分类。

可用性和实现

本研究中使用的所有数据集和代码可在https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html上免费获取。

联系方式

ivan.borozan@gmail.com

补充信息

补充数据可在《生物信息学》在线获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验