Kitsou Konstantina, Katzourakis Aris, Magiorkinis Gkikas
Department of Hygiene, Epidemiology and Medical Statistics, National and Kapodistrian University of Athens, Athens 11527, Greece.
Department of Zoology, University of Oxford, Oxford OX1 4BH, UK.
NAR Genom Bioinform. 2024 Jul 9;6(3):lqae081. doi: 10.1093/nargab/lqae081. eCollection 2024 Sep.
Human endogenous retroviruses (HERVs), the remnants of ancient germline retroviral integrations, comprise almost 8% of the human genome. The elucidation of their biological roles is hampered by our inability to link HERV mRNA and protein production with specific HERV loci. To solve the riddle of the integration-specific RNA expression of HERVs, several bioinformatics approaches have been proposed; however, no single process seems to yield optimal results due to the repetitiveness of HERV integrations. The performance of existing data-bioinformatics pipelines has been evaluated against real world datasets whose true expression profile is unknown, thus the accuracy of widely-used approaches remains unclear. Here, we simulated mRNA production from specific HERV integrations to evaluate second and third generation sequencing technologies along with widely used bioinformatic approaches to estimate the accuracy in describing integration-specific expression. We demonstrate that, while a HERV-family approach offers accurate results, per-integration analyses of HERV expression suffer from substantial expression bias, which is only partially mitigated by algorithms developed for calculating the per-integration HERV expression, and is more pronounced in recent integrations. Hence, this bias could erroneously result into biologically meaningful inferences. Finally, we demonstrate the merits of accurate long-read high-throughput sequencing technologies in the resolution of per-locus HERV expression.
人类内源性逆转录病毒(HERVs)是古代种系逆转录病毒整合的残余物,几乎占人类基因组的8%。由于我们无法将HERV mRNA和蛋白质的产生与特定的HERV基因座联系起来,因此对其生物学作用的阐明受到了阻碍。为了解开HERVs整合特异性RNA表达之谜,人们提出了几种生物信息学方法;然而,由于HERV整合的重复性,似乎没有一个单一的过程能产生最佳结果。现有的数据生物信息学流程的性能已针对真实世界数据集进行了评估,但其真实表达谱是未知的,因此广泛使用的方法的准确性仍不清楚。在这里,我们模拟了特定HERV整合产生的mRNA,以评估第二代和第三代测序技术以及广泛使用的生物信息学方法,以估计描述整合特异性表达的准确性。我们证明,虽然HERV家族方法能提供准确的结果,但对HERV表达的逐个整合分析存在严重的表达偏差,为计算逐个整合的HERV表达而开发的算法只能部分缓解这种偏差,并且在最近的整合中更为明显。因此,这种偏差可能会错误地导致具有生物学意义的推断。最后,我们展示了精确的长读高通量测序技术在解析逐个基因座的HERV表达方面的优点。