Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China.
Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae292.
Profile hidden Markov models (pHMMs) are able to achieve high sensitivity in remote homology search, making them popular choices for detecting novel or highly diverged viruses in metagenomic data. However, many existing pHMM databases have different design focuses, making it difficult for users to decide the proper one to use. In this review, we provide a thorough evaluation and comparison for multiple commonly used profile HMM databases for viral sequence discovery in metagenomic data. We characterized the databases by comparing their sizes, their taxonomic coverage, and the properties of their models using quantitative metrics. Subsequently, we assessed their performance in virus identification across multiple application scenarios, utilizing both simulated and real metagenomic data. We aim to offer researchers a thorough and critical assessment of the strengths and limitations of different databases. Furthermore, based on the experimental results obtained from the simulated and real metagenomic data, we provided practical suggestions for users to optimize their use of pHMM databases, thus enhancing the quality and reliability of their findings in the field of viral metagenomics.
隐马尔可夫模型(pHMMs)能够在远程同源搜索中实现高灵敏度,因此成为在宏基因组数据中检测新型或高度分化病毒的热门选择。然而,许多现有的 pHMM 数据库具有不同的设计重点,使用户难以决定使用哪个数据库。在这篇综述中,我们对多个常用于宏基因组数据中病毒序列发现的通用 pHMM 数据库进行了全面的评估和比较。我们通过比较数据库的大小、分类学覆盖范围和模型的属性,使用定量指标对数据库进行了特征描述。随后,我们利用模拟和真实的宏基因组数据评估了它们在多个应用场景下的病毒识别性能。我们旨在为研究人员提供对不同数据库的优缺点的全面和批判性评估。此外,根据从模拟和真实宏基因组数据中获得的实验结果,我们为用户提供了优化 pHMM 数据库使用的实用建议,从而提高病毒宏基因组学领域研究结果的质量和可靠性。