Allen Daniel R, Thankachan Sharma V, Xu Bojian
IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):138-148. doi: 10.1109/TCBB.2020.2968531. Epub 2021 Feb 3.
This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n log n), while maintaining a practical space complexity at O(kn), where n is the string length. When , which is the hard case, our new proposal significantly improves the any-case O(n) time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200 KB sample DNA sequence with k=1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than 1/4 of the serial implementation's time cost, when processing a 10 MB sample DNA sequence with k=2. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any , while this new proposal, using 24 cores, can finish processing a sample of this size with k=1 in 206.376 seconds with a peak memory usage of 46 GB, which is both easily available and affordable on Cloud. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology. We also give a theoretical bound that the k-mismatch shortest unique substring finding problem can be solved using O(n log n) time and O(n) space, asymptotically much better than the one we implemented, serving as a new discovery of interest.
本文重新审视了k错配最短唯一子串查找问题,并证明了最近在解决k错配平均公共子串问题的背景下提出的一种技术可以进行调整,并与现有解决方案的部分内容相结合,从而产生一种新算法,该算法的期望时间复杂度为O(n log n),同时保持实际空间复杂度为O(kn),其中n是字符串长度。当k = 时,这是一个困难的情况,我们的新方案显著提高了之前用于k错配最短唯一子串查找的最佳方法在任何情况下的O(n)时间复杂度。实验研究表明,我们的新算法易于实现,并且当k相对于n较小时,与之前最佳解决方案的实现相比,在处理时间上有显著改进。例如,我们的方法处理一个k = 1的200KB示例DNA序列仅需0.18秒,而之前的最佳解决方案则需要174.37秒。此外,我们观察到,采用两种不同的简单并发模型,可以并行执行该调整技术的大部分内容,从而进一步显著提高实际性能。例如,当使用8个核心时,在处理一个k = 2的10MB示例DNA序列时,并行实现的处理时间均不到串行实现时间成本的1/4。在一个通过云基础设施提供商可以轻松获得数千GB内存实例的时代,许多用户可能希望并需要用额外的内存使用来换取显著提高的处理时间。例如,之前的最佳解决方案可能需要数年时间才能完成任何k值下200MB的DNA样本处理,而我们的新方案使用24个核心,在峰值内存使用为46GB的情况下,只需206.376秒就能完成k = 1时该大小样本的处理,这在云上既容易获得又价格合理。预计这种用于k错配最短唯一子串查找的高效实用新算法将对计算生物学等领域中对长序列进行该测量的人员有用。我们还给出了一个理论界限,即k错配最短唯一子串查找问题可以在O(n log n)时间和O(n)空间内解决,渐近地比我们实现的算法要好得多,这是一个新的有趣发现。