Integrative Omics, Pacific Northwest National Laboratory, Richland, 99352, WA, USA.
Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, 99352, WA, USA.
BMC Bioinformatics. 2018 Jun 11;19(1):221. doi: 10.1186/s12859-018-2210-6.
Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data.
The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. We demonstrate the utility of our algorithm using two common bioinformatics tasks: identifying data sets with similar gene expression profiles, and comparing annotated genomes.
The BSF is a highly efficient pairwise similarity algorithm that can scale to billions of comparisons without the need for specialized hardware.
在数据挖掘中,识别数据集之间的相似性是一项基本任务,并且已经成为现代科学研究不可或缺的一部分。无论是在大规模表达调查中识别共同表达的基因,还是预测会引起类似表型的基因敲除组合,基本的计算任务通常都是多维相似性测试。随着数据集的不断增长,提高此类计算的效率、灵敏度或特异性将产生广泛的影响,因为它使科学家能够更全面地探索丰富的科学数据。
Blazing Signature Filter (BSF) 是一种高效的成对相似性算法,可在合理的时间内实现广泛的数据挖掘。该算法将数据集转换为二进制指标,从而可以利用计算效率高的位运算符并提供相似性的粗略度量。我们使用两个常见的生物信息学任务来演示我们算法的实用性:识别具有相似基因表达谱的数据集,以及比较已注释的基因组。
BSF 是一种高效的成对相似性算法,可以扩展到数十亿次比较,而无需特殊硬件。