Podvalnyi Artem, Kopernik Arina, Sayganova Mariia, Woroncow Mary, Zobkova Gauhar, Smirnova Anna, Esibov Anton, Deviatkin Andrey, Volchkov Pavel, Albert Eugene
Federal Research Center for Innovator and Emerging Biomedical and Pharmaceutical Technologies, 125315 Moscow, Russia.
Faculty of Computer Science, HSE University, 101000 Moscow, Russia.
Int J Mol Sci. 2025 Jan 3;26(1):363. doi: 10.3390/ijms26010363.
A pseudogene is a non-functional copy of a protein-coding gene. Processed pseudogenes, which are created by the reverse transcription of mRNA and subsequent integration of the resulting cDNA into the genome, being a major pseudogene class, represent a significant challenge in genome analysis due to their high sequence similarity to the parent genes and their frequent absence in the reference genome. This homology can lead to errors in variant identification, as sequences derived from processed pseudogenes can be incorrectly assigned to parental genes, complicating correct variant calling. In this study, we quantified the occurrence of variant calling errors associated with pseudogenes, generated by the most popular germline variant callers, namely GATK-HC, DRAGEN, and DeepVariant, when analysing 30x human whole-genome sequencing data (n = 13,307). The results show that the presence of pseudogenes can interfere with variant calling, leading to false positive identifications of potentially clinically relevant variants. Compared to other approaches, DeepVariant was the most effective in correcting these errors.
假基因是蛋白质编码基因的无功能拷贝。加工假基因是由mRNA的逆转录以及随后将所得cDNA整合到基因组中产生的,作为主要的假基因类别,由于它们与亲本基因的高度序列相似性以及它们在参考基因组中的频繁缺失,在基因组分析中构成了重大挑战。这种同源性可能导致变异识别错误,因为来自加工假基因的序列可能被错误地分配给亲本基因,从而使正确的变异调用变得复杂。在本研究中,我们在分析30倍深度的人类全基因组测序数据(n = 13307)时,对最流行的种系变异调用工具,即GATK-HC、DRAGEN和DeepVariant,产生的与假基因相关的变异调用错误的发生率进行了量化。结果表明,假基因的存在会干扰变异调用,导致对潜在临床相关变异的假阳性识别。与其他方法相比,DeepVariant在纠正这些错误方面最有效。