cath-resolve-hits：一个快速解决可疑域名匹配的新工具。

cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly.

机构信息

Department of Structural and Molecular Biology, UCL, Darwin Building, London, UK.

Department of Biological and Medical Sciences, Faculty of Health and Life Sciences, Oxford Brookes University, Oxford, Oxfordshire, UK.

出版信息

Bioinformatics. 2019 May 15;35(10):1766-1767. doi: 10.1093/bioinformatics/bty863.

DOI:10.1093/bioinformatics/bty863

PMID:30295745

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6513158/

Abstract

MOTIVATION

Many bioinformatics areas require us to assign domain matches onto stretches of a query protein. Starting with a set of candidate matches, we want to identify the optimal subset that has limited/no overlap between matches. This may be further complicated by discontinuous domains in the input data. Existing tools are increasingly facing very large data-sets for which they require prohibitive amounts of CPU-time and memory.

RESULTS

We present cath-resolve-hits (CRH), a new tool that uses a dynamic-programming algorithm implemented in open-source C++ to handle large datasets quickly (up to ∼1 million hits/second) and in reasonable amounts of memory. It accepts multiple input formats and provides its output in plain text, JSON or graphical HTML. We describe a benchmark against an existing algorithm, which shows CRH delivers very similar or slightly improved results and very much improved CPU/memory performance on large datasets.

AVAILABILITY AND IMPLEMENTATION

CRH is available at https://github.com/UCLOrengoGroup/cath-tools; documentation is available at http://cath-tools.readthedocs.io.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

许多生物信息学领域都要求我们将域匹配分配到查询蛋白质的片段上。从一组候选匹配开始，我们希望确定最佳子集，使匹配之间的重叠有限/无。这可能会因输入数据中的不连续域而变得更加复杂。现有的工具越来越面临着非常大的数据集，而这些数据集需要大量的 CPU 时间和内存。

结果

我们提出了 cath-resolve-hits（CRH），这是一种新工具，它使用开源 C++中的动态编程算法来快速处理大数据集（高达约 100 万次命中/秒），并使用合理数量的内存。它接受多种输入格式，并以纯文本、JSON 或图形 HTML 提供输出。我们描述了一个与现有算法的基准比较，结果表明 CRH 提供了非常相似或略有改进的结果，并且在大数据集上的 CPU/内存性能有了很大的提高。

可用性和实现

CRH 可在 https://github.com/UCLOrengoGroup/cath-tools 上获得；文档可在 http://cath-tools.readthedocs.io 上获得。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d450/6513158/3ccdd8156bfa/bty863f1.jpg

相似文献

cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly. cath-resolve-hits：一个快速解决可疑域名匹配的新工具。

Bioinformatics. 2019 May 15;35(10):1766-1767. doi: 10.1093/bioinformatics/bty863.

ODGI: understanding pangenome graphs.ODGI：理解泛基因组图谱。

Bioinformatics. 2022 Jun 27;38(13):3319-3326. doi: 10.1093/bioinformatics/btac308.

GSEApy: a comprehensive package for performing gene set enrichment analysis in Python.GSEApy：一个用于在 Python 中进行基因集富集分析的综合软件包。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac757.

A multi-objective optimization approach accurately resolves protein domain architectures.一种多目标优化方法能准确解析蛋白质结构域架构。

Bioinformatics. 2016 Feb 1;32(3):345-53. doi: 10.1093/bioinformatics/btv582. Epub 2015 Oct 12.

CoCoNet: an efficient deep learning tool for viral metagenome binning.CoCoNet：一种用于病毒宏基因组分箱的高效深度学习工具。

Bioinformatics. 2021 Sep 29;37(18):2803-2810. doi: 10.1093/bioinformatics/btab213.

BioQueue: a novel pipeline framework to accelerate bioinformatics analysis.BioQueue：一种用于加速生物信息学分析的新型管道框架。

Bioinformatics. 2017 Oct 15;33(20):3286-3288. doi: 10.1093/bioinformatics/btx403.

BamToCov: an efficient toolkit for sequence coverage calculations.BamToCov：用于序列覆盖度计算的高效工具包。

Bioinformatics. 2022 Apr 28;38(9):2617-2618. doi: 10.1093/bioinformatics/btac125.

SPECTRE: a suite of phylogenetic tools for reticulate evolution.SPECTRE：一套用于网状进化的系统发育工具。

Bioinformatics. 2018 Mar 15;34(6):1056-1057. doi: 10.1093/bioinformatics/btx740.

Unbiased pangenome graphs.无偏泛基因组图。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac743.

CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models.CATHe：使用蛋白质语言模型的嵌入来检测 CATH 超家族的远程同源物。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad029.

引用本文的文献

A prevalent huge phage clade in human and animal gut microbiomes.在人类和动物肠道微生物群中普遍存在的一个巨大噬菌体分支。

Res Sq. 2025 Aug 19:rs.3.rs-7356405. doi: 10.21203/rs.3.rs-7356405/v1.

A prevalent huge phage clade in human and animal gut microbiomes.人类和动物肠道微生物群中普遍存在的一个巨大噬菌体分支。

bioRxiv. 2025 Aug 11:2025.08.10.669567. doi: 10.1101/2025.08.10.669567.

Predicting human and viral protein variants affecting COVID-19 susceptibility and repurposing therapeutics.预测影响 COVID-19 易感性和重新利用治疗方法的人类和病毒蛋白变异体。

Sci Rep. 2024 Jun 20;14(1):14208. doi: 10.1038/s41598-024-61541-1.

A novel computational pipeline for gene expression augments the discovery of changes in the transcriptome during transition from in vivo to short-term in vitro culture.一种新的基因表达计算管道增强了在体内到短期体外培养转变过程中转录组变化的发现。

Elife. 2024 Jan 25;12:RP87726. doi: 10.7554/eLife.87726.

Infant microbiome cultivation and metagenomic analysis reveal Bifidobacterium 2'-fucosyllactose utilization can be facilitated by coexisting species.婴儿微生物组培养和宏基因组分析揭示双歧杆菌 2'-岩藻糖基乳糖的利用可以通过共存物种得到促进。

Nat Commun. 2023 Nov 16;14(1):7417. doi: 10.1038/s41467-023-43279-y.

Phage-encoded ribosomal protein S21 expression is linked to late-stage phage replication.噬菌体编码的核糖体蛋白S21的表达与噬菌体后期复制相关。

ISME Commun. 2022 Mar 30;2(1):31. doi: 10.1038/s43705-022-00111-w.

Broad functional profiling of fission yeast proteins using phenomics and machine learning.利用表型组学和机器学习对裂殖酵母蛋白质进行广泛的功能分析。

Elife. 2023 Oct 3;12:RP88229. doi: 10.7554/eLife.88229.

Ultra-deep sequencing of Hadza hunter-gatherers recovers vanishing gut microbes.对哈扎狩猎采集者进行超高深度测序可发现正在消失的肠道微生物。

Cell. 2023 Jul 6;186(14):3111-3124.e13. doi: 10.1016/j.cell.2023.05.046. Epub 2023 Jun 21.

Deep self-supervised learning for biosynthetic gene cluster detection and product classification.深度自监督学习在生物合成基因簇检测和产物分类中的应用。

PLoS Comput Biol. 2023 May 23;19(5):e1011162. doi: 10.1371/journal.pcbi.1011162. eCollection 2023 May.

FAS: assessing the similarity between proteins using multi-layered feature architectures.FAS：使用多层特征架构评估蛋白质之间的相似性。

Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad226.

本文引用的文献

Gene3D: Extensive prediction of globular domains in proteins.Gene3D：蛋白质球状结构域的广泛预测。

Nucleic Acids Res. 2018 Jan 4;46(D1):D435-D439. doi: 10.1093/nar/gkx1069.

InterPro in 2017-beyond protein family and domain annotations.2017年的InterPro——超越蛋白质家族和结构域注释

Nucleic Acids Res. 2017 Jan 4;45(D1):D190-D199. doi: 10.1093/nar/gkw1107. Epub 2016 Nov 29.

CATH: an expanded resource to predict protein function through structure and sequence.CATH：一个通过结构和序列预测蛋白质功能的扩展资源。

Nucleic Acids Res. 2017 Jan 4;45(D1):D289-D295. doi: 10.1093/nar/gkw1098. Epub 2016 Nov 28.

Gene3D: expanding the utility of domain assignments.基因3D：拓展结构域分配的效用

Nucleic Acids Res. 2016 Jan 4;44(D1):D404-9. doi: 10.1093/nar/gkv1231. Epub 2015 Nov 17.

SIFTS: Structure Integration with Function, Taxonomy and Sequences resource.SIFTS：结构整合与功能、分类学和序列资源。

Nucleic Acids Res. 2013 Jan;41(Database issue):D483-9. doi: 10.1093/nar/gks1258. Epub 2012 Nov 29.

IMG/M: the integrated metagenome data management and comparative analysis system.IMG/M：一体化宏基因组数据管理与比较分析系统。

Nucleic Acids Res. 2012 Jan;40(Database issue):D123-9. doi: 10.1093/nar/gkr975. Epub 2011 Nov 15.

A fast and automated solution for accurately resolving protein domain architectures.一种快速且自动化的解决方案，可准确解析蛋白质结构域架构。

Bioinformatics. 2010 Mar 15;26(6):745-51. doi: 10.1093/bioinformatics/btq034. Epub 2010 Jan 29.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

cath-resolve-hits：一个快速解决可疑域名匹配的新工具。

cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献