使用长读长进行基因分型结构变异。

SVJedi: genotyping structural variations with long reads.

机构信息

Univ Rennes, Inria, CNRS, IRISA, F-35000 Rennes, France.

出版信息

Bioinformatics. 2020 Nov 1;36(17):4568-4575. doi: 10.1093/bioinformatics/btaa527.

DOI:10.1093/bioinformatics/btaa527

PMID:32437523

Abstract

MOTIVATION

Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies.

RESULTS

We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches.

AVAILABILITY AND IMPLEMENTATION

https://github.com/llecompte/SVJedi.git.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

结构变异（SVs）的研究正在迅速扩展。因此，由于第三代测序技术的发展，已发现的 SV 数量正在增加，尤其是在人类基因组中。同时，对于临床诊断等几种应用，对新测序的个体进行明确定义和特征化的 SV 基因分型非常重要。虽然已经开发了几种用于短读数据的 SV 基因分型器，但缺乏这样的专用工具来评估新的长读测序样本中是否存在已知的 SV，例如 Pacific Biosciences 或 Oxford Nanopore Technologies 生产的样本。

结果

我们提出了一种从长读测序数据中对已知 SV 进行基因分型的新方法。该方法基于生成一组代表每个结构变异的两个等位基因的代表性等位基因序列。将长读与这些等位基因序列对齐。然后分析和过滤对齐，以保留唯一的信息，以量化和估计每个 SV 等位基因的存在和等位基因频率。我们提供了一种用于对长读进行 SV 基因分型的方法 SVJedi 的实现。该工具已应用于模拟和真实的人类数据集，并实现了高基因分型准确性。我们表明，SVJedi 比其他现有的长读基因分型工具具有更好的性能，并且与其他方法（即 SV 发现和短读 SV 基因分型方法）相比，SV 基因分型得到了极大的改进。