一种用于在 DNA 数据库上高效发现唯一签名的并行增量算法。

A parallel and incremental algorithm for efficient unique signature discovery on DNA databases.

机构信息

Department of Computer Science and Communication Engineering, Providence University, Taichung, 43301 Taiwan, ROC.

出版信息

BMC Bioinformatics. 2010 Mar 16;11:132. doi: 10.1186/1471-2105-11-132.

DOI:10.1186/1471-2105-11-132

PMID:20230647

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2848650/

Abstract

BACKGROUND

DNA signatures are distinct short nucleotide sequences that provide valuable information that is used for various purposes, such as the design of Polymerase Chain Reaction primers and microarray experiments. Biologists usually use a discovery algorithm to find unique signatures from DNA databases, and then apply the signatures to microarray experiments. Such discovery algorithms require to set some input factors, such as signature length l and mismatch tolerance d, which affect the discovery results. However, suggestions about how to select proper factor values are rare, especially when an unfamiliar DNA database is used. In most cases, biologists typically select factor values based on experience, or even by guessing. If the discovered result is unsatisfactory, biologists change the input factors of the algorithm to obtain a new result. This process is repeated until a proper result is obtained. Implicit signatures under the discovery condition (l, d) are defined as the signatures of length < or = l with mismatch tolerance > or = d. A discovery algorithm that could discover all implicit signatures, such that those that meet the requirements concerning the results, would be more helpful than one that depends on trial and error. However, existing discovery algorithms do not address the need to discover all implicit signatures.

RESULTS

This work proposes two discovery algorithms - the consecutive multiple discovery (CMD) algorithm and the parallel and incremental signature discovery (PISD) algorithm. The PISD algorithm is designed for efficiently discovering signatures under a certain discovery condition. The algorithm finds new results by using previously discovered results as candidates, rather than by using the whole database. The PISD algorithm further increases discovery efficiency by applying parallel computing. The CMD algorithm is designed to discover implicit signatures efficiently. It uses the PISD algorithm as a kernel routine to discover implicit signatures efficiently under every feasible discovery condition.

CONCLUSIONS

The proposed algorithms discover implicit signatures efficiently. The presented CMD algorithm has up to 97% less execution time than typical sequential discovery algorithms in the discovery of implicit signatures in experiments, when eight processing cores are used.

摘要

背景

DNA 特征是独特的短核苷酸序列，提供了有价值的信息，可用于各种目的，如聚合酶链反应引物和微阵列实验的设计。生物学家通常使用发现算法从 DNA 数据库中找到独特的特征，然后将这些特征应用于微阵列实验。此类发现算法需要设置一些输入因素，如特征长度 l 和错配容忍度 d，这些因素会影响发现结果。然而，关于如何选择合适的因子值的建议很少，尤其是在使用不熟悉的 DNA 数据库时。在大多数情况下，生物学家通常根据经验选择因子值，甚至是猜测。如果发现的结果不理想，生物学家会更改算法的输入因子以获得新的结果。这个过程会一直重复，直到得到合适的结果。在发现条件 (l, d) 下隐含的特征被定义为长度<或= l 且错配容忍度>或= d 的特征。一个能够发现所有隐含特征的发现算法，即那些满足结果要求的特征，会比依赖于试错的算法更有帮助。然而，现有的发现算法并没有解决发现所有隐含特征的需求。

结果

本工作提出了两种发现算法 - 连续多次发现（CMD）算法和并行递增特征发现（PISD）算法。PISD 算法旨在有效地在特定发现条件下发现特征。该算法通过使用先前发现的结果作为候选，而不是使用整个数据库来找到新的结果。PISD 算法通过应用并行计算进一步提高了发现效率。CMD 算法旨在有效地发现隐含特征。它使用 PISD 算法作为核心例程，在每个可行的发现条件下高效地发现隐含特征。

结论

所提出的算法有效地发现了隐含特征。在使用八个处理核的情况下，与典型的顺序发现算法相比，提出的 CMD 算法在实验中发现隐含特征的执行时间减少了 97%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/162f/2848650/fffa0cee424d/1471-2105-11-132-1.jpg

相似文献

A parallel and incremental algorithm for efficient unique signature discovery on DNA databases.

BMC Bioinformatics. 2010 Mar 16;11:132. doi: 10.1186/1471-2105-11-132.

An algorithm of discovering signatures from DNA databases on a computer cluster.

BMC Bioinformatics. 2014 Oct 5;15(1):339. doi: 10.1186/1471-2105-15-339.

EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences.

BMC Bioinformatics. 2006 Jul 13;7:342. doi: 10.1186/1471-2105-7-342.

An improved heuristic algorithm for finding motif signals in DNA sequences.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Jul-Aug;8(4):959-75. doi: 10.1109/TCBB.2010.92.

SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

BMC Bioinformatics. 2018 Jun 18;19(1):228. doi: 10.1186/s12859-018-2242-y.

Parallelized evolutionary learning for detection of biclusters in gene expression data.

IEEE/ACM Trans Comput Biol Bioinform. 2012;9(2):560-70. doi: 10.1109/TCBB.2011.53. Epub 2011 Mar 3.

Efficient sequential and parallel algorithms for finding edit distance based motifs.

BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):465. doi: 10.1186/s12864-016-2789-9.

An Efficient Exact Algorithm for Planted Motif Search on Large DNA Sequence Datasets.

IEEE/ACM Trans Comput Biol Bioinform. 2024 Sep-Oct;21(5):1542-1551. doi: 10.1109/TCBB.2024.3404136. Epub 2024 Oct 9.

A generic motif discovery algorithm for sequential data.

Bioinformatics. 2006 Jan 1;22(1):21-8. doi: 10.1093/bioinformatics/bti745. Epub 2005 Oct 27.

Fast exact algorithms for the closest string and substring problems with application to the planted (L, d)-motif model.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1400-10. doi: 10.1109/TCBB.2011.21.

引用本文的文献

Cluster oligonucleotide signatures for rapid identification by sequencing.

BMC Bioinformatics. 2018 Oct 29;19(1):395. doi: 10.1186/s12859-018-2363-3.

HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing.

Evol Bioinform Online. 2016 Feb 10;12:73-85. doi: 10.4137/EBO.S35545. eCollection 2016.

An algorithm of discovering signatures from DNA databases on a computer cluster.

BMC Bioinformatics. 2014 Oct 5;15(1):339. doi: 10.1186/1471-2105-15-339.

Conserved PCR primer set designing for closely-related species to complete mitochondrial genome sequencing using a sliding window-based PSO algorithm.

PLoS One. 2011 Mar 18;6(3):e17729. doi: 10.1371/journal.pone.0017729.

A method for automatically extracting infectious disease-related primers and probes from the literature.

BMC Bioinformatics. 2010 Aug 3;11:410. doi: 10.1186/1471-2105-11-410.

本文引用的文献

Insignia: a DNA signature search web server for diagnostic assay development.

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W229-34. doi: 10.1093/nar/gkp286. Epub 2009 May 5.

hybseek: pathogen primer design tool for diagnostic multi-analyte assays.

Comput Methods Programs Biomed. 2009 May;94(2):152-60. doi: 10.1016/j.cmpb.2008.12.007. Epub 2009 Feb 6.

Comprehensive DNA signature discovery and validation.

PLoS Comput Biol. 2007 May;3(5):e98. doi: 10.1371/journal.pcbi.0030098. Epub 2007 Apr 20.

Smashing peacocks further: drawing quasi-trees from biconnected components.

IEEE Trans Vis Comput Graph. 2006 Sep-Oct;12(5):813-20. doi: 10.1109/TVCG.2006.177.

A DNA biosensor based on peptide nucleic acids on gold surfaces.

Biosens Bioelectron. 2007 Apr 15;22(9-10):1926-32. doi: 10.1016/j.bios.2006.08.012. Epub 2006 Sep 25.

Multiple detection of food-borne pathogenic bacteria using a novel 16S rDNA-based oligonucleotide signature chip.

Biosens Bioelectron. 2007 Jan 15;22(6):845-53. doi: 10.1016/j.bios.2006.03.005. Epub 2006 Apr 18.

Rapid bacterial identification using evanescent-waveguide oligonucleotide microarray classification.

J Microbiol Methods. 2006 Jun;65(3):390-403. doi: 10.1016/j.mimet.2005.08.012. Epub 2005 Oct 10.

Rapid large-scale oligonucleotide selection for microarrays.

Proc IEEE Comput Soc Bioinform Conf. 2002;1:54-63.

YODA: selecting signature oligonucleotides.

Bioinformatics. 2005 Apr 15;21(8):1365-70. doi: 10.1093/bioinformatics/bti182. Epub 2004 Nov 30.

Picky: oligo microarray design for large genomes.

Bioinformatics. 2004 Nov 22;20(17):2893-902. doi: 10.1093/bioinformatics/bth347. Epub 2004 Jun 4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于在 DNA 数据库上高效发现唯一签名的并行增量算法。

A parallel and incremental algorithm for efficient unique signature discovery on DNA databases.

机构信息

Department of Computer Science and Communication Engineering, Providence University, Taichung, 43301 Taiwan, ROC.