Yuan Xiguo, Gao Meihong, Bai Jun, Duan Junbo
IEEE/ACM Trans Comput Biol Bioinform. 2020 May-Jun;17(3):1082-1091. doi: 10.1109/TCBB.2018.2876527. Epub 2018 Oct 17.
Structural variation accounts for a major fraction of mutations in the human genome and confers susceptibility to complex diseases. Next generation sequencing along with the rapid development of computational methods provides a cost-effective procedure to detect such variations. Simulation of structural variations and sequencing reads with real characteristics is essential for benchmarking the computational methods. Here, we develop a new program, SVSR, to simulate five types of structural variations (indels, tandem duplication, CNVs, inversions, and translocations) and SNPs for the human genome and to generate sequencing reads with features from popular platforms (Illumina, SOLiD, 454, and Ion Torrent). We adopt a selection model trained from real data to predict copy number states, starting from the first site of a particular genome to the end. Furthermore, we utilize references of microbial genomes to produce insertion fragments and design probabilistic models to imitate inversions and translocations. Moreover, we create platform-specific errors and base quality profiles to generate normal, tumor, or normal-tumor mixture reads. Experimental results show that SVSR could capture more features that are realistic and generate datasets with satisfactory quality scores. SVSR is able to evaluate the performance of structural variation detection methods and guide the development of new computational methods.
结构变异占人类基因组突变的很大一部分,并赋予对复杂疾病的易感性。随着计算方法的快速发展,下一代测序提供了一种经济高效的程序来检测此类变异。模拟具有真实特征的结构变异和测序读数对于评估计算方法至关重要。在这里,我们开发了一个新程序SVSR,用于模拟人类基因组的五种结构变异(插入缺失、串联重复、拷贝数变异、倒位和易位)和单核苷酸多态性,并生成具有流行平台(Illumina、SOLiD、454和Ion Torrent)特征的测序读数。我们采用从真实数据训练的选择模型来预测拷贝数状态,从特定基因组的第一个位点到最后一个位点。此外,我们利用微生物基因组的参考来产生插入片段,并设计概率模型来模拟倒位和易位。此外,我们创建特定于平台的错误和碱基质量概况,以生成正常、肿瘤或正常-肿瘤混合读数。实验结果表明,SVSR可以捕获更多现实的特征,并生成具有令人满意质量分数的数据集。SVSR能够评估结构变异检测方法的性能,并指导新计算方法的开发。