Department of Computer Engineering, Bilkent University, Ankara.
Department of Computer Engineering, Konya Food and Agriculture University, Konya, Turkey.
Bioinformatics. 2019 Oct 15;35(20):3923-3930. doi: 10.1093/bioinformatics/btz237.
Several algorithms have been developed that use high-throughput sequencing technology to characterize structural variations (SVs). Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions, likewise, duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants.
We developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing datasets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real datasets. In the simulation experiments, using a 30× coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state-of-the-art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (<5% for the top 50 predictions).
TARDIS source code is available at https://github.com/BilkentCompGen/tardis, and a corresponding Docker image is available at https://hub.docker.com/r/alkanlab/tardis/.
Supplementary data are available at Bioinformatics online.
已经开发了几种算法,这些算法使用高通量测序技术来描述结构变异(SV)。现有的大多数方法都侧重于检测相对简单的 SV 类型,例如插入、缺失和短倒置。事实上,复杂的 SV 至关重要,其中一些与基因组疾病有关。为了更好地理解复杂 SV 对人类疾病的贡献,我们需要新的算法来准确发现和分型此类变体。此外,由于具有相似的测序特征,包括倒置片段重复的倒置重复或基因转换事件通常被描述为简单倒置,同样,直接定向的重复和基因转换可能被称为简单缺失。因此,仍然需要准确的算法来充分描述复杂的 SV,从而提高更简单变体的调用准确性。
我们开发了新的算法,用于使用短读全基因组测序数据集准确地描述串联、直接和倒置分散的片段重复。我们将这些方法集成到我们的 TARDIS 工具中,该工具现在能够使用多个序列特征(如读对、读深度和分裂读)来检测各种类型的 SV。我们通过使用模拟和真实数据集的几个实验来评估我们算法的预测性能。在模拟实验中,使用 30×覆盖的 TARDIS 实现了 96%的灵敏度,假阳性率仅为 4%。对于涉及真实数据的实验,我们使用了 Illumina Platinum Genomes 集中的两个单倍体基因组(CHM1 和 CHM13)和一个人类基因组(NA12878)。与来自同一基因组的正交 PacBio 调用集的结果比较表明,TARDIS 的准确性高于最先进的方法。此外,我们还展示了我们的方法在 CHM1 上预测串联、直接和倒置分散的片段重复时的假阳性率非常低(前 50 个预测的假阳性率<5%)。
TARDIS 的源代码可在 https://github.com/BilkentCompGen/tardis 上获得,相应的 Docker 镜像可在 https://hub.docker.com/r/alkanlab/tardis/ 上获得。
补充数据可在《生物信息学》在线获得。