Department of Biology, BYU, Provo, UT 84602, USA.
Bioinformatics. 2017 Dec 15;33(24):3922-3928. doi: 10.1093/bioinformatics/btx538.
One of the main challenges with bioinformatics software is that the size and complexity of datasets necessitate trading speed for accuracy, or completeness. To combat this problem of computational complexity, a plethora of heuristic algorithms have arisen that report a 'good enough' solution to biological questions. However, in instances such as Simple Sequence Repeats (SSRs), a 'good enough' solution may not accurately portray results in population genetics, phylogenetics and forensics, which require accurate SSRs to calculate intra- and inter-species interactions.
We present Kmer-SSR, which finds all SSRs faster than most heuristic SSR identification algorithms in a parallelized, easy-to-use manner. The exhaustive Kmer-SSR option has 100% precision and 100% recall and accurately identifies every SSR of any specified length. To identify more biologically pertinent SSRs, we also developed several filters that allow users to easily view a subset of SSRs based on user input. Kmer-SSR, coupled with the filter options, accurately and intuitively identifies SSRs quickly and in a more user-friendly manner than any other SSR identification algorithm.
The source code is freely available on GitHub at https://github.com/ridgelab/Kmer-SSR.
生物信息学软件面临的主要挑战之一是,数据集的大小和复杂性需要在速度和准确性或完整性之间进行权衡。为了解决计算复杂度的问题,出现了大量启发式算法,这些算法为生物问题提供了一个“足够好”的解决方案。然而,在简单序列重复(SSR)等情况下,“足够好”的解决方案可能无法准确描述群体遗传学、系统发生学和法医学中的结果,这些领域需要准确的 SSR 来计算种内和种间相互作用。
我们提出了 Kmer-SSR,它以并行化、易于使用的方式比大多数启发式 SSR 识别算法更快地找到所有 SSR。详尽的 Kmer-SSR 选项具有 100%的精度和 100%的召回率,并且可以准确识别任何指定长度的每个 SSR。为了识别更具生物学意义的 SSR,我们还开发了几个过滤器,允许用户根据用户输入轻松查看 SSR 的子集。Kmer-SSR 与过滤器选项结合使用,可以比任何其他 SSR 识别算法更准确、直观地快速识别 SSR,并且更用户友好。
源代码可在 GitHub 上免费获得,网址为 https://github.com/ridgelab/Kmer-SSR。