纳米孔测序结合独特的分子标识符可实现复杂脂蛋白(a) KIV-2 VNTR 中的精确突变分析和单倍型分型。

BACKGROUND: Repetitive genome regions, such as variable number of tandem repeats (VNTR) or short tandem repeats (STR), are major constituents of the uncharted dark genome and evade conventional sequencing approaches. The protein-coding LPA kringle IV type-2 (KIV-2) VNTR (5.6 kb per unit, 1-40 units per allele) is a medically highly relevant example with a particularly intricate structure, multiple haplotypes, intragenic homologies, and an intra-VNTR STR. It is the primary regulator of plasma lipoprotein(a) [Lp(a)] concentrations, an important cardiovascular risk factor. Lp(a) concentrations vary widely between individuals and ancestries. Multiple variants and functional haplotypes in the LPA gene and especially in the KIV-2 VNTR strongly contribute to this variance. METHODS: We evaluated the performance of amplicon-based nanopore sequencing with unique molecular identifiers (UMI-ONT-Seq) for SNP detection, haplotype mapping, VNTR unit consensus sequence generation, and copy number estimation via coverage-corrected haplotypes quantification in the KIV-2 VNTR. We used 15 human samples and low-level mixtures (0.5 to 5%) of KIV-2 plasmids as a validation set. We then applied UMI-ONT-Seq to extract KIV-2 VNTR haplotypes in 48 multi-ancestry 1000 Genome samples and analyzed at scale a poorly characterized STR within the KIV-2 VNTR. RESULTS: UMI-ONT-Seq detected KIV-2 SNPs down to 1% variant level with high sensitivity, specificity, and precision (0.977 ± 0.018; 1.000 ± 0.0005; 0.993 ± 0.02) and accurately retrieved the full-length haplotype of each VNTR unit. Human variant levels were highly correlated with next-generation sequencing (R = 0.983) without bias across the whole variant level range. Six reads per UMI produced sequences of each KIV-2 unit with Q40 quality. The KIV-2 repeat number determined by coverage-corrected unique haplotype counting was in close agreement with droplet digital PCR (ddPCR), with 70% of the samples falling even within the narrow confidence interval of ddPCR. We then analyzed 62,679 intra-KIV-2 STR sequences and explored KIV-2 SNP haplotype patterns across five ancestries. CONCLUSIONS: UMI-ONT-Seq accurately retrieves the SNP haplotype and precisely quantifies the VNTR copy number of each repeat unit of the complex KIV-2 VNTR region across multiple ancestries. This study utilizes the KIV-2 VNTR, presenting a novel and potent tool for comprehensive characterization of medically relevant complex genome regions at scale.

背景：重复的基因组区域，如可变数量串联重复（VNTR）或短串联重复（STR），是未被发现的暗基因组的主要组成部分，逃避了传统的测序方法。蛋白编码 LPA 环 IV 型-2（KIV-2）VNTR（每个单位 5.6 kb，每个等位基因 1-40 个单位）是一个具有特殊结构、多个单倍型、基因内同源性和内含子 STR 的具有重要医学相关性的例子。它是血浆脂蛋白（a）[Lp（a）]浓度的主要调节剂，是一个重要的心血管风险因素。Lp（a）浓度在个体和祖源之间差异很大。LPA 基因中的多个变体和功能单倍型，尤其是 KIV-2 VNTR 中的单倍型，对这种差异有很大贡献。

方法：我们评估了基于扩增子的纳米孔测序与独特分子标识符（UMI-ONT-Seq）在 SNP 检测、单倍型作图、VNTR 单位共识序列生成和通过覆盖校正单倍型定量进行拷贝数估计方面的性能在 KIV-2 VNTR 中。我们使用了 15 个人类样本和低水平（0.5 至 5%）的 KIV-2 质粒混合物作为验证集。然后，我们应用 UMI-ONT-Seq 从 48 个多祖源 1000 基因组样本中提取 KIV-2 VNTR 单倍型，并对 KIV-2 VNTR 内一个特征较差的 STR 进行了大规模分析。

结果：UMI-ONT-Seq 以高灵敏度、特异性和精度（0.977±0.018；1.000±0.0005；0.993±0.02）检测到 KIV-2 SNPs 低至 1%的变异水平，并且能够准确地获取每个 VNTR 单位的全长单倍型。人类变异水平与下一代测序（R=0.983）高度相关，在整个变异水平范围内没有偏差。每个 UMI 产生 6 个读取，每个 KIV-2 单元的质量为 Q40。通过覆盖校正的独特单倍型计数确定的 KIV-2 重复数与液滴数字 PCR（ddPCR）非常吻合，70%的样本甚至落在 ddPCR 的狭窄置信区间内。然后，我们分析了 62679 个 KIV-2 内 STR 序列，并探索了五个祖源中 KIV-2 SNP 单倍型模式。

结论：UMI-ONT-Seq 能够准确地获取 SNP 单倍型，并精确地量化复杂 KIV-2 VNTR 区域中每个重复单位的 VNTR 拷贝数，适用于多个祖源。本研究利用 KIV-2 VNTR，为全面描述具有重要医学意义的复杂基因组区域提供了一种新颖而有效的工具。