Bashir Ali, Volik Stanislav, Collins Colin, Bafna Vineet, Raphael Benjamin J
Bioinformatics Graduate Program, University of California San Diego, San Diego, California, United States of America.
PLoS Comput Biol. 2008 Apr 25;4(4):e1000051. doi: 10.1371/journal.pcbi.1000051.
Paired-end sequencing is emerging as a key technique for assessing genome rearrangements and structural variation on a genome-wide scale. This technique is particularly useful for detecting copy-neutral rearrangements, such as inversions and translocations, which are common in cancer and can produce novel fusion genes. We address the question of how much sequencing is required to detect rearrangement breakpoints and to localize them precisely using both theoretical models and simulation. We derive a formula for the probability that a fusion gene exists in a cancer genome given a collection of paired-end sequences from this genome. We use this formula to compute fusion gene probabilities in several breast cancer samples, and we find that we are able to accurately predict fusion genes in these samples with a relatively small number of fragments of large size. We further demonstrate how the ability to detect fusion genes depends on the distribution of gene lengths, and we evaluate how different parameters of a sequencing strategy impact breakpoint detection, breakpoint localization, and fusion gene detection, even in the presence of errors that suggest false rearrangements. These results will be useful in calibrating future cancer sequencing efforts, particularly large-scale studies of many cancer genomes that are enabled by next-generation sequencing technologies.
双末端测序正在成为一种在全基因组范围内评估基因组重排和结构变异的关键技术。该技术对于检测拷贝数中性重排(如倒位和易位)特别有用,这些重排在癌症中很常见,并且可以产生新的融合基因。我们通过理论模型和模拟来解决需要多少测序才能检测重排断点并精确定位它们的问题。我们推导了一个公式,用于计算给定来自该基因组的双末端序列集合时癌症基因组中存在融合基因的概率。我们使用这个公式计算了几个乳腺癌样本中的融合基因概率,并且发现我们能够用相对少量的大片段准确预测这些样本中的融合基因。我们进一步证明了检测融合基因的能力如何取决于基因长度的分布,并且我们评估了测序策略的不同参数如何影响断点检测、断点定位和融合基因检测,即使存在提示假重排的错误。这些结果将有助于校准未来的癌症测序工作,特别是由下一代测序技术推动的对许多癌症基因组的大规模研究。