Department of Structural and Molecular Biology, UCL, Darwin Building, London, UK.
Department of Biological and Medical Sciences, Faculty of Health and Life Sciences, Oxford Brookes University, Oxford, Oxfordshire, UK.
Bioinformatics. 2019 May 15;35(10):1766-1767. doi: 10.1093/bioinformatics/bty863.
Many bioinformatics areas require us to assign domain matches onto stretches of a query protein. Starting with a set of candidate matches, we want to identify the optimal subset that has limited/no overlap between matches. This may be further complicated by discontinuous domains in the input data. Existing tools are increasingly facing very large data-sets for which they require prohibitive amounts of CPU-time and memory.
We present cath-resolve-hits (CRH), a new tool that uses a dynamic-programming algorithm implemented in open-source C++ to handle large datasets quickly (up to ∼1 million hits/second) and in reasonable amounts of memory. It accepts multiple input formats and provides its output in plain text, JSON or graphical HTML. We describe a benchmark against an existing algorithm, which shows CRH delivers very similar or slightly improved results and very much improved CPU/memory performance on large datasets.
CRH is available at https://github.com/UCLOrengoGroup/cath-tools; documentation is available at http://cath-tools.readthedocs.io.
Supplementary data are available at Bioinformatics online.
许多生物信息学领域都要求我们将域匹配分配到查询蛋白质的片段上。从一组候选匹配开始,我们希望确定最佳子集,使匹配之间的重叠有限/无。这可能会因输入数据中的不连续域而变得更加复杂。现有的工具越来越面临着非常大的数据集,而这些数据集需要大量的 CPU 时间和内存。
我们提出了 cath-resolve-hits(CRH),这是一种新工具,它使用开源 C++中的动态编程算法来快速处理大数据集(高达约 100 万次命中/秒),并使用合理数量的内存。它接受多种输入格式,并以纯文本、JSON 或图形 HTML 提供输出。我们描述了一个与现有算法的基准比较,结果表明 CRH 提供了非常相似或略有改进的结果,并且在大数据集上的 CPU/内存性能有了很大的提高。
CRH 可在 https://github.com/UCLOrengoGroup/cath-tools 上获得;文档可在 http://cath-tools.readthedocs.io 上获得。
补充数据可在 Bioinformatics 在线获得。