Brait Nadja, Hackl Thomas, Morel Côme, Exbrayat Antoni, Gutierrez Serafin, Lequime Sebastian
Cluster of Microbial Ecology, Groningen Institute for Evolutionary Life Sciences, University of Groningen, Groningen 9747 AG, The Netherlands.
ASTRE research unit, Cirad, INRAe, Université de Montpellier, Montpellier 34398, France.
Virus Evol. 2023 Dec 28;10(1):vead088. doi: 10.1093/ve/vead088. eCollection 2024.
Large-scale metagenomic and -transcriptomic studies have revolutionized our understanding of viral diversity and abundance. In contrast, endogenous viral elements (EVEs), remnants of viral sequences integrated into host genomes, have received limited attention in the context of virus discovery, especially in RNA-Seq data. EVEs resemble their original viruses, a challenge that makes distinguishing between active infections and integrated remnants difficult, affecting virus classification and biases downstream analyses. Here, we systematically assess the effects of EVEs on a prototypical virus discovery pipeline, evaluate their impact on data integrity and classification accuracy, and provide some recommendations for better practices. We examined EVEs and exogenous viral sequences linked to , a diverse family of negative-sense segmented RNA viruses, in 13 genomic and 538 transcriptomic datasets of Culicinae mosquitoes. Our analysis revealed a substantial number of viral sequences in transcriptomic datasets. However, a significant portion appeared not to be exogenous viruses but transcripts derived from EVEs. Distinguishing between transcribed EVEs and exogenous virus sequences was especially difficult in samples with low viral abundance. For example, three transcribed EVEs showed full-length segments, devoid of frameshift and nonsense mutations, exhibiting sufficient mean read depths that qualify them as exogenous virus hits. Mapping reads on a host genome containing EVEs before assembly somewhat alleviated the EVE burden, but it led to a drastic reduction of viral hits and reduced quality of assemblies, especially in regions of the viral genome relatively similar to EVEs. Our study highlights that our knowledge of the genetic diversity of viruses can be altered by the underestimated presence of EVEs in transcriptomic datasets, leading to false positives and altered or missing sequence information. Thus, recognizing and addressing the influence of EVEs in virus discovery pipelines will be key in enhancing our ability to capture the full spectrum of viral diversity.
大规模宏基因组学和转录组学研究彻底改变了我们对病毒多样性和丰度的理解。相比之下,内源性病毒元件(EVE),即整合到宿主基因组中的病毒序列残余物,在病毒发现的背景下受到的关注有限,尤其是在RNA测序数据中。EVE与其原始病毒相似,这一挑战使得区分活跃感染和整合残余物变得困难,影响病毒分类并导致下游分析出现偏差。在此,我们系统地评估了EVE对典型病毒发现流程的影响,评估它们对数据完整性和分类准确性的影响,并提供一些更好实践的建议。我们在库蚊亚科蚊子的13个基因组和538个转录组数据集中检查了与一个多样的负链分节RNA病毒家族相关的EVE和外源病毒序列。我们的分析揭示了转录组数据集中存在大量病毒序列。然而,很大一部分似乎不是外源病毒,而是来自EVE的转录本。在病毒丰度较低的样本中,区分转录的EVE和外源病毒序列尤其困难。例如,三个转录的EVE显示出全长片段,没有移码和无义突变,平均读深度足以使其被视为外源病毒命中。在组装前将 reads 映射到包含EVE的宿主基因组上在一定程度上减轻了EVE的负担,但导致病毒命中数大幅减少,组装质量下降,尤其是在病毒基因组中与EVE相对相似的区域。我们的研究强调,转录组数据集中EVE的存在被低估,可能会改变我们对病毒遗传多样性的认识,导致假阳性以及序列信息的改变或缺失。因此,认识并解决EVE在病毒发现流程中的影响,将是增强我们捕获病毒多样性全貌能力的关键。