读取云图揭示了人类基因组复杂区域的变异。

Read clouds uncover variation in complex regions of the human genome.

作者信息

Bishara Alex, Liu Yuling, Weng Ziming, Kashef-Haghighi Dorna, Newburger Daniel E, West Robert, Sidow Arend, Batzoglou Serafim

机构信息

Department of Computer Science, Stanford University, Stanford, California 94305, USA;

Department of Computer Science, Stanford University, Stanford, California 94305, USA; Department of Chemistry, Stanford University, Stanford, California 94305, USA;

出版信息

Genome Res. 2015 Oct;25(10):1570-80. doi: 10.1101/gr.191189.115. Epub 2015 Aug 18.

DOI:10.1101/gr.191189.115

PMID:26286554

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4579342/

Abstract

Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies.

摘要

尽管越来越多的人类遗传变异被识别和记录下来，但确定人类基因组重复序列中的变异仍然是一项挑战。因此，大多数群体和全基因组关联研究都无法考虑这些区域的变异。问题的核心在于缺乏一种测序技术，能够产生足够长度和准确性的读段以实现唯一比对。在此，我们提出一种新方法，即通过对源自长片段文库的DNA进行精确短读长测序获得读云，从而可靠地将短读段比对到重复区域内并实现准确的变异发现。我们的新算法——随机场比对器（RFA），通过马尔可夫随机场捕捉长读段过程所支配的短读段之间的关系。我们使用了Illumina TruSeq合成长读段方案的一个修改版本，该方案产生了浅测序读云。我们通过广泛的模拟测试了RFA，并将其应用于在NA12878人类样本上发现变异，该样本有可用的浅TruSeq读云测序数据，以及应用于我们使用相同方法测序的侵袭性乳腺癌基因组。我们证明，RFA有助于准确恢复人类基因组155 Mb中的变异，包括目前短读段技术无法检测到的67 Mb节段重复序列中的94%以及11 Mb转录序列中的96%。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

读取云图揭示了人类基因组复杂区域的变异。

Read clouds uncover variation in complex regions of the human genome.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

读取云图揭示了人类基因组复杂区域的变异。

Read clouds uncover variation in complex regions of the human genome.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献