Eitan Rami, Shamir Ron
Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv-Yafo, Israel.
BMC Bioinformatics. 2017 Nov 15;18(1):488. doi: 10.1186/s12859-017-1929-9.
During cancer progression genomes undergo point mutations as well as larger segmental changes. The latter include, among others, segmental deletions duplications, translocations and inversions.The result is a highly complex, patient-specific cancer karyotype. Using high-throughput technologies of deep sequencing and microarrays it is possible to interrogate a cancer genome and produce chromosomal copy number profiles and a list of breakpoints ("jumps") relative to the normal genome. This information is very detailed but local, and does not give the overall picture of the cancer genome. One of the basic challenges in cancer genome research is to use such information to infer the cancer karyotype. We present here an algorithmic approach, based on graph theory and integer linear programming, that receives segmental copy number and breakpoint data as input and produces a cancer karyotype that is most concordant with them. We used simulations to evaluate the utility of our approach, and applied it to real data.
By using a simulation model, we were able to estimate the correctness and robustness of the algorithm in a spectrum of scenarios. Under our base scenario, designed according to observations in real data, the algorithm correctly inferred 69% of the karyotypes. However, when using less stringent correctness metrics that account for incomplete and noisy data, 87% of the reconstructed karyotypes were correct. Furthermore, in scenarios where the data were very clean and complete, accuracy rose to 90%-100%. Some examples of analysis of real data, and the reconstructed karyotypes suggested by our algorithm, are also presented.
While reconstruction of complete, perfect karyotype based on short read data is very hard, a large fraction of the reconstruction will still be correct and can provide useful information.
在癌症进展过程中,基因组会发生点突变以及更大规模的片段变化。后者包括片段缺失、重复、易位和倒位等。其结果是形成高度复杂的、患者特异性的癌症核型。利用深度测序和微阵列等高通量技术,可以对癌症基因组进行检测,并生成相对于正常基因组的染色体拷贝数图谱和断点(“跳跃”)列表。这些信息非常详细但具有局部性,无法给出癌症基因组的全貌。癌症基因组研究的基本挑战之一是利用此类信息推断癌症核型。我们在此提出一种基于图论和整数线性规划的算法方法,该方法以片段拷贝数和断点数据作为输入,并生成与之最相符的癌症核型。我们通过模拟来评估该方法的效用,并将其应用于实际数据。
通过使用模拟模型,我们能够在一系列场景中评估算法的正确性和稳健性。在根据实际数据观察设计的基础场景下,该算法正确推断出69%的核型。然而,当使用不太严格的正确性指标来考虑不完整和有噪声的数据时,87%的重建核型是正确的。此外,在数据非常干净和完整的场景中,准确率上升到90%-100%。还给出了一些实际数据分析的例子以及我们算法建议的重建核型。
虽然基于短读长数据重建完整、完美的核型非常困难,但大部分重建结果仍将是正确的,并能提供有用信息。