School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China; MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China; Genome Institute, the First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061, China; Leiden Institute of Advanced Computer Science, Faculty of Science, Leiden University, Leiden 2311EZ, Netherland.
MOE Key Lab for Intelligent Networks & Networks Security, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China; School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.
Genomics Proteomics Bioinformatics. 2022 Feb;20(1):205-218. doi: 10.1016/j.gpb.2021.03.007. Epub 2021 Jul 3.
Complex structural variants (CSVs) are genomic alterations that have more than two breakpoints and are considered as the simultaneous occurrence of simple structural variants. However, detecting the compounded mutational signals of CSVs is challenging through a commonly used model-match strategy. As a result, there has been limited progress for CSV discovery compared with simple structural variants. Here, we systematically analyzed the multi-breakpoint connection feature of CSVs, and proposed Mako, utilizing a bottom-up guided model-free strategy, to detect CSVs from paired-end short-read sequencing. Specifically, we implemented a graph-based pattern growth approach, where the graph depicts potential breakpoint connections, and pattern growth enables CSV detection without pre-defined models. Comprehensive evaluations on both simulated and real datasets revealed that Mako outperformed other algorithms. Notably, validation rates of CSVs on real data based on experimental and computational validations as well as manual inspections are around 70%, where the medians of experimental and computational breakpoint shift are 13 bp and 26 bp, respectively. Moreover, the Mako CSV subgraph effectively characterized the breakpoint connections of a CSV event and uncovered a total of 15 CSV types, including two novel types of adjacent segment swap and tandem dispersed duplication. Further analysis of these CSVs also revealed the impact of sequence homology on the formation of CSVs. Mako is publicly available at https://github.com/xjtu-omics/Mako.
复杂结构变异(CSVs)是指具有两个以上断点的基因组改变,被认为是简单结构变异的同时发生。然而,通过常用的模型匹配策略来检测 CSVs 的复合突变信号具有挑战性。因此,与简单结构变异相比,CSVs 的发现进展有限。在这里,我们系统地分析了 CSVs 的多断点连接特征,并提出了 Mako,利用自下而上的无模型引导策略,从配对末端短读测序中检测 CSVs。具体来说,我们实现了一种基于图的模式生长方法,其中图描绘了潜在的断点连接,模式生长使得无需预定义模型即可进行 CSV 检测。在模拟和真实数据集上的综合评估表明,Mako 优于其他算法。值得注意的是,基于实验和计算验证以及手动检查的真实数据上 CSVs 的验证率约为 70%,实验和计算断点移位的中位数分别为 13bp 和 26bp。此外,Mako CSV 子图有效地描述了 CSV 事件的断点连接,并揭示了总共 15 种 CSV 类型,包括两种新的相邻片段交换和串联分散重复类型。对这些 CSVs 的进一步分析还揭示了序列同源性对 CSVs 形成的影响。Mako 可在 https://github.com/xjtu-omics/Mako 上公开获取。