Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Mol Biol Evol. 2021 Jun 25;38(7):2958-2966. doi: 10.1093/molbev/msab062.
LINE-1-mediated retrotransposition of protein-coding mRNAs is an active process in modern humans for both germline and somatic genomes. Prior works that surveyed human data mostly relied on detecting discordant mappings of paired-end short reads, or exon junctions contained in short reads. Moreover, there have been few genome-wide comparisons between gene retrocopies in great apes and humans. In this study, we introduced a more sensitive and accurate method to identify processed pseudogenes. Our method utilizes long-read assemblies, and more importantly, is able to provide full-length retrocopy sequences as well as flanking regions which are missed by short-read based methods. From 22 human individuals, we pinpointed 40 processed pseudogenes that are not present in the human reference genome GRCh38 and identified 17 pseudogenes that are in GRCh38 but absent from some input individuals. This represents a significantly higher discovery rate than previous reports (39 pseudogenes not in the reference genome out of 939 individuals). We also provided an overview of lineage-specific retrocopies in chimpanzee, gorilla, and orangutan genomes.
LINE-1 介导的蛋白编码 mRNA 的反转录转座是现代人类生殖系和体细胞基因组中一种活跃的过程。之前的研究主要依赖于检测成对短读序列的不一致映射,或短读序列中包含的外显子连接。此外,在大型猿类和人类之间的基因返座体之间进行全基因组比较的研究很少。在这项研究中,我们引入了一种更敏感和准确的方法来识别加工假基因。我们的方法利用长读序列组装,更重要的是,能够提供全长返座序列以及侧翼区域,而这些区域是基于短读序列的方法所缺失的。从 22 个人类个体中,我们确定了 40 个不存在于人类参考基因组 GRCh38 中的加工假基因,并鉴定了 17 个存在于 GRCh38 但不存在于一些输入个体中的假基因。这代表了比以前的报告(在 939 个人类个体中,有 39 个不在参考基因组中的假基因)更高的发现率。我们还概述了黑猩猩、大猩猩和猩猩基因组中的谱系特异性返座体。