Suppr超能文献

基于直方图的无比对序列比较统计的调查与评估。

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.

出版信息

Brief Bioinform. 2019 Jul 19;20(4):1222-1237. doi: 10.1093/bib/bbx161.

Abstract

MOTIVATION

Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences.

RESULTS

We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours.

AVAILABILITY

The source code of the benchmarking tool is available as Supplementary Materials.

摘要

动机

自生物信息学领域诞生以来,序列比对评分一直是比较序列的主要方法。然而,比对算法是二次的,需要很长的执行时间。作为替代方案,科学家们已经开发了数十种无比对统计方法来衡量两个序列之间的相似性。

结果

我们调查了数十种无比对 k-mer 统计方法。此外,我们评估了 33 种统计方法以及它们之间的乘法组合,或它们的平方。这些统计数据是基于两个表示两个序列的 k-mer 直方图计算的。我们使用全局比对评分的评估结果表明,大多数统计数据都是敏感的,能够找到与查询序列相似的序列。因此,这些统计数据中的任何一个都可以快速过滤掉不相似的序列。此外,我们观察到,统计数据的乘法组合与身份评分高度相关。此外,组合涉及序列长度差异或考虑到长度差异的 Earth Mover's 距离,总是与身份评分相关性最高的配对统计数据之一。同样,包括长度差异或 Earth Mover's 距离的配对统计数据在找到 K 个最近序列方面表现最佳。有趣的是,使用较短单词的直方图可以获得类似的性能,从而显著降低内存需求并提高速度。此外,我们发现简单的单统计数据足以处理下一代测序读取,并应用于依赖局部比对的应用程序。最后,我们测量了每个统计数据的时间要求。该调查和评估将帮助科学家们识别出昂贵的比对算法的高效替代方法,节省数千个计算小时。

可用性

基准测试工具的源代码可作为补充材料获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4510/6781583/c76abe8dc5f4/bbx161f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验