Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, 100 Washtenaw Avenue, Ann Arbor, MI 48109, USA.
Department of Human Genetics, University of Michigan Medical School, 1241 East Catherine Street, Ann Arbor, MI 48109, USA.
Nucleic Acids Res. 2020 Feb 20;48(3):1146-1163. doi: 10.1093/nar/gkz1173.
Long Interspersed Element-1 (LINE-1) retrotransposition contributes to inter- and intra-individual genetic variation and occasionally can lead to human genetic disorders. Various strategies have been developed to identify human-specific LINE-1 (L1Hs) insertions from short-read whole genome sequencing (WGS) data; however, they have limitations in detecting insertions in complex repetitive genomic regions. Here, we developed a computational tool (PALMER) and used it to identify 203 non-reference L1Hs insertions in the NA12878 benchmark genome. Using PacBio long-read sequencing data, we identified L1Hs insertions that were absent in previous short-read studies (90/203). Approximately 81% (73/90) of the L1Hs insertions reside within endogenous LINE-1 sequences in the reference assembly and the analysis of unique breakpoint junction sequences revealed 63% (57/90) of these L1Hs insertions could be genotyped in 1000 Genomes Project sequences. Moreover, we observed that amplification biases encountered in single-cell WGS experiments led to a wide variation in L1Hs insertion detection rates between four individual NA12878 cells; under-amplification limited detection to 32% (65/203) of insertions, whereas over-amplification increased false positive calls. In sum, these data indicate that L1Hs insertions are often missed using standard short-read sequencing approaches and long-read sequencing approaches can significantly improve the detection of L1Hs insertions present in individual genomes.
长散布元件-1(LINE-1)反转录转座导致个体间和个体内遗传变异,偶尔会导致人类遗传疾病。已经开发了各种策略来从短读长全基因组测序(WGS)数据中鉴定人类特异性 LINE-1(L1Hs)插入;然而,它们在检测复杂重复基因组区域中的插入方面存在局限性。在这里,我们开发了一种计算工具(PALMER),并使用它在 NA12878 基准基因组中鉴定了 203 个非参考 L1Hs 插入。使用 PacBio 长读测序数据,我们鉴定了先前短读研究中缺失的 L1Hs 插入(90/203)。大约 81%(73/90)的 L1Hs 插入位于参考组装中的内源性 LINE-1 序列内,对独特的断点连接序列的分析表明,这些 L1Hs 插入中的 63%(57/90)可以在 1000 基因组计划序列中进行基因分型。此外,我们观察到,单细胞 WGS 实验中遇到的扩增偏差导致四个个体的 NA12878 细胞之间 L1Hs 插入检测率存在广泛差异;低扩增将检测限制在 203 个插入中的 32%(65/203),而过度扩增会增加假阳性调用。总之,这些数据表明,标准的短读测序方法经常会错过 L1Hs 插入,而长读测序方法可以显著提高对个体基因组中存在的 L1Hs 插入的检测。