Department of Computer Science and Communication Engineering, Providence University, Taichung, 43301 Taiwan, ROC.
BMC Bioinformatics. 2010 Mar 16;11:132. doi: 10.1186/1471-2105-11-132.
DNA signatures are distinct short nucleotide sequences that provide valuable information that is used for various purposes, such as the design of Polymerase Chain Reaction primers and microarray experiments. Biologists usually use a discovery algorithm to find unique signatures from DNA databases, and then apply the signatures to microarray experiments. Such discovery algorithms require to set some input factors, such as signature length l and mismatch tolerance d, which affect the discovery results. However, suggestions about how to select proper factor values are rare, especially when an unfamiliar DNA database is used. In most cases, biologists typically select factor values based on experience, or even by guessing. If the discovered result is unsatisfactory, biologists change the input factors of the algorithm to obtain a new result. This process is repeated until a proper result is obtained. Implicit signatures under the discovery condition (l, d) are defined as the signatures of length < or = l with mismatch tolerance > or = d. A discovery algorithm that could discover all implicit signatures, such that those that meet the requirements concerning the results, would be more helpful than one that depends on trial and error. However, existing discovery algorithms do not address the need to discover all implicit signatures.
This work proposes two discovery algorithms - the consecutive multiple discovery (CMD) algorithm and the parallel and incremental signature discovery (PISD) algorithm. The PISD algorithm is designed for efficiently discovering signatures under a certain discovery condition. The algorithm finds new results by using previously discovered results as candidates, rather than by using the whole database. The PISD algorithm further increases discovery efficiency by applying parallel computing. The CMD algorithm is designed to discover implicit signatures efficiently. It uses the PISD algorithm as a kernel routine to discover implicit signatures efficiently under every feasible discovery condition.
The proposed algorithms discover implicit signatures efficiently. The presented CMD algorithm has up to 97% less execution time than typical sequential discovery algorithms in the discovery of implicit signatures in experiments, when eight processing cores are used.
DNA 特征是独特的短核苷酸序列,提供了有价值的信息,可用于各种目的,如聚合酶链反应引物和微阵列实验的设计。生物学家通常使用发现算法从 DNA 数据库中找到独特的特征,然后将这些特征应用于微阵列实验。此类发现算法需要设置一些输入因素,如特征长度 l 和错配容忍度 d,这些因素会影响发现结果。然而,关于如何选择合适的因子值的建议很少,尤其是在使用不熟悉的 DNA 数据库时。在大多数情况下,生物学家通常根据经验选择因子值,甚至是猜测。如果发现的结果不理想,生物学家会更改算法的输入因子以获得新的结果。这个过程会一直重复,直到得到合适的结果。在发现条件 (l, d) 下隐含的特征被定义为长度<或= l 且错配容忍度>或= d 的特征。一个能够发现所有隐含特征的发现算法,即那些满足结果要求的特征,会比依赖于试错的算法更有帮助。然而,现有的发现算法并没有解决发现所有隐含特征的需求。
本工作提出了两种发现算法 - 连续多次发现(CMD)算法和并行递增特征发现(PISD)算法。PISD 算法旨在有效地在特定发现条件下发现特征。该算法通过使用先前发现的结果作为候选,而不是使用整个数据库来找到新的结果。PISD 算法通过应用并行计算进一步提高了发现效率。CMD 算法旨在有效地发现隐含特征。它使用 PISD 算法作为核心例程,在每个可行的发现条件下高效地发现隐含特征。
所提出的算法有效地发现了隐含特征。在使用八个处理核的情况下,与典型的顺序发现算法相比,提出的 CMD 算法在实验中发现隐含特征的执行时间减少了 97%。