一种用于寻找k错配最短唯一子串的超快速并行算法。

Allen Daniel R, Thankachan Sharma V, Xu Bojian

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):138-148. doi: 10.1109/TCBB.2020.2968531. Epub 2021 Feb 3.

This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n log n), while maintaining a practical space complexity at O(kn), where n is the string length. When , which is the hard case, our new proposal significantly improves the any-case O(n) time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200 KB sample DNA sequence with k=1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than 1/4 of the serial implementation's time cost, when processing a 10 MB sample DNA sequence with k=2. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any , while this new proposal, using 24 cores, can finish processing a sample of this size with k=1 in 206.376 seconds with a peak memory usage of 46 GB, which is both easily available and affordable on Cloud. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology. We also give a theoretical bound that the k-mismatch shortest unique substring finding problem can be solved using O(n log n) time and O(n) space, asymptotically much better than the one we implemented, serving as a new discovery of interest.

本文重新审视了k错配最短唯一子串查找问题，并证明了最近在解决k错配平均公共子串问题的背景下提出的一种技术可以进行调整，并与现有解决方案的部分内容相结合，从而产生一种新算法，该算法的期望时间复杂度为O(n log n)，同时保持实际空间复杂度为O(kn)，其中n是字符串长度。当k = 时，这是一个困难的情况，我们的新方案显著提高了之前用于k错配最短唯一子串查找的最佳方法在任何情况下的O(n)时间复杂度。实验研究表明，我们的新算法易于实现，并且当k相对于n较小时，与之前最佳解决方案的实现相比，在处理时间上有显著改进。例如，我们的方法处理一个k = 1的200KB示例DNA序列仅需0.18秒，而之前的最佳解决方案则需要174.37秒。此外，我们观察到，采用两种不同的简单并发模型，可以并行执行该调整技术的大部分内容，从而进一步显著提高实际性能。例如，当使用8个核心时，在处理一个k = 2的10MB示例DNA序列时，并行实现的处理时间均不到串行实现时间成本的1/4。在一个通过云基础设施提供商可以轻松获得数千GB内存实例的时代，许多用户可能希望并需要用额外的内存使用来换取显著提高的处理时间。例如，之前的最佳解决方案可能需要数年时间才能完成任何k值下200MB的DNA样本处理，而我们的新方案使用24个核心，在峰值内存使用为46GB的情况下，只需206.376秒就能完成k = 1时该大小样本的处理，这在云上既容易获得又价格合理。预计这种用于k错配最短唯一子串查找的高效实用新算法将对计算生物学等领域中对长序列进行该测量的人员有用。我们还给出了一个理论界限，即k错配最短唯一子串查找问题可以在O(n log n)时间和O(n)空间内解决，渐近地比我们实现的算法要好得多，这是一个新的有趣发现。

相似文献

An Ultra-Fast and Parallelizable Algorithm for Finding k-Mismatch Shortest Unique Substrings.一种用于寻找k错配最短唯一子串的超快速并行算法。

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):138-148. doi: 10.1109/TCBB.2020.2968531. Epub 2021 Feb 3.

Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU.

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):386-395. doi: 10.1109/TCBB.2019.2935061. Epub 2021 Feb 4.

Fast exact algorithms for the closest string and substring problems with application to the planted (L, d)-motif model.快速精确算法求解最接近字符串和子字符串问题及其在 (L, d)-基序模型中的应用。

IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1400-10. doi: 10.1109/TCBB.2011.21.

A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem.一种用于解决k错配平均公共子串问题的可证明高效算法。

J Comput Biol. 2016 Jun;23(6):472-82. doi: 10.1089/cmb.2015.0235. Epub 2016 Apr 8.

On the complexity of positional sequencing by hybridization.论基于杂交的位置测序的复杂性

J Comput Biol. 2001;8(4):361-71. doi: 10.1089/106652701752236188.

An O(N2) algorithm for discovering optimal Boolean pattern pairs.一种用于发现最优布尔模式对的O(N²)算法。

IEEE/ACM Trans Comput Biol Bioinform. 2004 Oct-Dec;1(4):159-70. doi: 10.1109/TCBB.2004.36.

Genome comparison without alignment using shortest unique substrings.使用最短唯一子串进行无需比对的基因组比较。

BMC Bioinformatics. 2005 May 23;6:123. doi: 10.1186/1471-2105-6-123.

Finding optimal threshold for correction error reads in DNA assembling.寻找DNA组装中校正错误读数的最佳阈值。

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S15. doi: 10.1186/1471-2105-10-S1-S15.

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.一种快速、无锁的方法，用于高效并行计数 k-mer 的出现次数。

Bioinformatics. 2011 Mar 15;27(6):764-70. doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.

Encoded expansion: an efficient algorithm to discover identical string motifs.编码扩展：一种发现相同字符串基序的高效算法。

PLoS One. 2014 May 28;9(5):e95148. doi: 10.1371/journal.pone.0095148. eCollection 2014.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

An Ultra-Fast and Parallelizable Algorithm for Finding k-Mismatch Shortest Unique Substrings.一种用于寻找k错配最短唯一子串的超快速并行算法。

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):138-148. doi: 10.1109/TCBB.2020.2968531. Epub 2021 Feb 3.

Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU.

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):386-395. doi: 10.1109/TCBB.2019.2935061. Epub 2021 Feb 4.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1400-10. doi: 10.1109/TCBB.2011.21.

A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem.一种用于解决k错配平均公共子串问题的可证明高效算法。

J Comput Biol. 2016 Jun;23(6):472-82. doi: 10.1089/cmb.2015.0235. Epub 2016 Apr 8.

On the complexity of positional sequencing by hybridization.论基于杂交的位置测序的复杂性

J Comput Biol. 2001;8(4):361-71. doi: 10.1089/106652701752236188.

An O(N2) algorithm for discovering optimal Boolean pattern pairs.一种用于发现最优布尔模式对的O(N²)算法。

IEEE/ACM Trans Comput Biol Bioinform. 2004 Oct-Dec;1(4):159-70. doi: 10.1109/TCBB.2004.36.

Genome comparison without alignment using shortest unique substrings.使用最短唯一子串进行无需比对的基因组比较。

BMC Bioinformatics. 2005 May 23;6:123. doi: 10.1186/1471-2105-6-123.

Finding optimal threshold for correction error reads in DNA assembling.寻找DNA组装中校正错误读数的最佳阈值。

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S15. doi: 10.1186/1471-2105-10-S1-S15.

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.一种快速、无锁的方法，用于高效并行计数 k-mer 的出现次数。

Bioinformatics. 2011 Mar 15;27(6):764-70. doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.

Encoded expansion: an efficient algorithm to discover identical string motifs.编码扩展：一种发现相同字符串基序的高效算法。

PLoS One. 2014 May 28;9(5):e95148. doi: 10.1371/journal.pone.0095148. eCollection 2014.

An Ultra-Fast and Parallelizable Algorithm for Finding k-Mismatch Shortest Unique Substrings.

作者信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献