Shatadru Rokaiya Nurani, Solonenko Natalie E, Sun Christine L, Sullivan Matthew B
Department of Microbiology, The Ohio State University, Columbus, Ohio, United States of America.
Center of Microbiome Science, The Ohio State University, Columbus, Ohio, United States of America.
PLoS Biol. 2025 Nov 20;23(11):e3003510. doi: 10.1371/journal.pbio.3003510. eCollection 2025 Nov.
Microbiomes influence diverse ecosystems, and viruses increasingly appear to impose key constraints. While viromics has expanded genomic catalogs, host identification for these viruses remains challenging due to the limitations in scaling cultivation-based approaches and the uncertain reliability and relative low resolution of in silico predictions - particularly for understudied viral taxa. Towards this, Hi-C proximity ligation uses sequenced, cross-linked virus and host genomic fragments to infer virus-host linkages and has now been applied in at least 10 studies. However, its accuracy remains unknown. Here we assess Hi-C performance in recovering virus-host interactions using synthetic communities (SynComs) composed of four marine bacterial strains and nine phages with known interactions and then apply optimized bioinformatic protocols to natural soil samples. In SynComs, standard Hi-C sample preparations and analyses showed poor normalized contact score performance (26% specificity, 100% sensitivity, incorrect matches up to class level) that could be dramatically improved by Z-score filtering (Z ≥ 0.5, 99% specificity), though at reduced sensitivity (62% down from 100%). Detection limits were established as reproducibility was poor below minimal phage abundances of 105 PFU/mL. Applying optimized bioinformatic protocols to natural soil samples, we compared virus-host linkages inferred from proximity-ligated Hi-C sequencing with predictions generated by in silico homology-based and machine learning-based bioinformatic approaches. Prior to Z-score thresholding, agreement was relatively high at the phylum to family levels (72%), but not at the genus (43%) or species (15%) levels. Z-score thresholding reduced sensitivity (only 34% of predictions were retained), with only modest improvements in congruence with bioinformatic methods (48% or 18% at genus or species levels, respectively). Regardless, this led to 79 genus-level-congruent virus-host linkages and 293 new ones revealed by Hi-C alone, i.e., providing many new virus-host interactions to explore in already well-studied climate-critical soils. Overall, these findings provide empirical benchmarks and methodological guidelines to improve the accuracy and reliability of Hi-C for virus-host linkage studies in complex microbial communities.
微生物群落影响着多样的生态系统,而病毒似乎越来越多地施加关键限制。虽然病毒组学已扩展了基因组目录,但由于基于培养方法的扩展性存在局限,以及计算机预测的可靠性不确定且分辨率相对较低,尤其是对于研究较少的病毒分类群,这些病毒的宿主鉴定仍然具有挑战性。为此,Hi-C邻近连接技术利用测序的、交联的病毒和宿主基因组片段来推断病毒-宿主联系,目前已在至少10项研究中得到应用。然而,其准确性尚不清楚。在这里,我们使用由四种海洋细菌菌株和九种具有已知相互作用的噬菌体组成的合成群落(SynComs)来评估Hi-C在恢复病毒-宿主相互作用方面的性能,然后将优化的生物信息学协议应用于天然土壤样本。在SynComs中,标准的Hi-C样本制备和分析显示归一化接触得分性能较差(特异性为26%,敏感性为100%,错误匹配可达分类水平),通过Z分数过滤(Z≥0.5,特异性为99%)可显著改善,不过敏感性有所降低(从100%降至62%)。由于在低于105 PFU/mL的最小噬菌体丰度下重现性较差,因此确定了检测限。将优化的生物信息学协议应用于天然土壤样本,我们将从邻近连接的Hi-C测序推断出的病毒-宿主联系与基于计算机同源性和基于机器学习的生物信息学方法生成的预测进行了比较。在进行Z分数阈值处理之前,在门到科水平上的一致性相对较高(72%),但在属(43%)或种(15%)水平上则不然。Z分数阈值处理降低了敏感性(仅保留了34%的预测),与生物信息学方法的一致性仅略有提高(在属或种水平上分别为48%或18%)。无论如何,这导致了79个属水平一致的病毒-宿主联系以及仅由Hi-C揭示的293个新联系,即在已经充分研究的对气候至关重要的土壤中提供了许多新的病毒-宿主相互作用以供探索。总体而言,这些发现提供了实证基准和方法指南,以提高Hi-C在复杂微生物群落中进行病毒-宿主联系研究的准确性和可靠性。