Suppr超能文献

Div-BLAST:序列搜索结果的多样化

Div-BLAST: diversification of sequence search results.

作者信息

Eser Elif, Can Tolga, Ferhatosmanoğlu Hakan

机构信息

Department of Computer Engineering, Bilkent University, Ankara, Turkey.

Department of Computer Engineering, Middle East Technical University, Ankara, Turkey.

出版信息

PLoS One. 2014 Dec 22;9(12):e115445. doi: 10.1371/journal.pone.0115445. eCollection 2014.

Abstract

Sequence similarity tools, such as BLAST, seek sequences most similar to a query from a database of sequences. They return results significantly similar to the query sequence and that are typically highly similar to each other. Most sequence analysis tasks in bioinformatics require an exploratory approach, where the initial results guide the user to new searches. However, diversity has not yet been considered an integral component of sequence search tools for this discipline. Some redundancy can be avoided by introducing non-redundancy during database construction, but it is not feasible to dynamically set a level of non-redundancy tailored to a query sequence. We introduce the problem of diverse search and browsing in sequence databases that produce non-redundant results optimized for any given query. We define diversity measures for sequences and propose methods to obtain diverse results extracted from current sequence similarity search tools. We also propose a new measure to evaluate the diversity of a set of sequences that is returned as a result of a sequence similarity query. We evaluate the effectiveness of the proposed methods in post-processing BLAST and PSI-BLAST results. We also assess the functional diversity of the returned results based on available Gene Ontology annotations. Additionally, we include a comparison with a current redundancy elimination tool, CD-HIT. Our experiments show that the proposed methods are able to achieve more diverse yet significant result sets compared to static non-redundancy approaches. In both sequence-based and functional diversity evaluation, the proposed diversification methods significantly outperform original BLAST results and other baselines. A web based tool implementing the proposed methods, Div-BLAST, can be accessed at cedar.cs.bilkent.edu.tr/Div-BLAST.

摘要

序列相似性工具,如BLAST,会从序列数据库中寻找与查询序列最相似的序列。它们返回的结果与查询序列显著相似,并且通常彼此高度相似。生物信息学中的大多数序列分析任务都需要一种探索性方法,即初始结果引导用户进行新的搜索。然而,多样性尚未被视为该学科序列搜索工具的一个不可或缺的组成部分。在数据库构建过程中引入非冗余性可以避免一些冗余,但动态设置适合查询序列的非冗余水平是不可行的。我们提出了在序列数据库中进行多样化搜索和浏览的问题,以产生针对任何给定查询进行优化的非冗余结果。我们定义了序列的多样性度量,并提出了从当前序列相似性搜索工具中提取多样化结果的方法。我们还提出了一种新的度量来评估作为序列相似性查询结果返回的一组序列的多样性。我们评估了所提出的方法在处理BLAST和PSI-BLAST结果方面的有效性。我们还根据可用的基因本体注释评估返回结果的功能多样性。此外,我们将其与当前的冗余消除工具CD-HIT进行了比较。我们的实验表明,与静态非冗余方法相比,所提出的方法能够实现更多样化但又显著的结果集。在基于序列的和功能多样性评估中,所提出的多样化方法明显优于原始的BLAST结果和其他基线。可以通过访问cedar.cs.bilkent.edu.tr/Div-BLAST来使用实现所提出方法的基于网络的工具Div-BLAST。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e02/4274030/787a4dc9a5df/pone.0115445.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验