1 Cystic Fibrosis/Pulmonary Research and Treatment Center, and.
Am J Respir Cell Mol Biol. 2014 Jan;50(1):223-32. doi: 10.1165/rcmb.2013-0235OC.
Despite modern sequencing efforts, the difficulty in assembly of highly repetitive sequences has prevented resolution of human genome gaps, including some in the coding regions of genes with important biological functions. One such gene, MUC5AC, encodes a large, secreted mucin, which is one of the two major secreted mucins in human airways. The MUC5AC region contains a gap in the human genome reference (hg19) across the large, highly repetitive, and complex central exon. This exon is predicted to contain imperfect tandem repeat sequences and multiple conserved cysteine-rich (CysD) domains. To resolve the MUC5AC genomic gap, we used high-fidelity long PCR followed by single molecule real-time (SMRT) sequencing. This technology yielded long sequence reads and robust coverage that allowed for de novo sequence assembly spanning the entire repetitive region. Furthermore, we used SMRT sequencing of PCR amplicons covering the central exon to identify genetic variation in four individuals. The results demonstrated the presence of segmental duplications of CysD domains, insertions/deletions (indels) of tandem repeats, and single nucleotide variants. Additional studies demonstrated that one of the identified tandem repeat insertions is tagged by nonexonic single nucleotide polymorphisms. Taken together, these data illustrate the successful utility of SMRT sequencing long reads for de novo assembly of large repetitive sequences to fill the gaps in the human genome. Characterization of the MUC5AC gene and the sequence variation in the central exon will facilitate genetic and functional studies for this critical airway mucin.
尽管现代测序技术取得了进展,但高度重复序列的组装难题仍未解决,这导致人类基因组的缺口无法确定,其中包括一些具有重要生物学功能的基因的编码区缺口。MUC5AC 基因就是这样一个例子,它编码一种大型分泌性粘蛋白,是人类气道中两种主要分泌性粘蛋白之一。MUC5AC 基因的区域在人类基因组参考序列(hg19)中存在一个缺口,横跨大型、高度重复且复杂的中央外显子。该外显子预计包含不完整的串联重复序列和多个保守的富含半胱氨酸(CysD)结构域。为了解决 MUC5AC 基因的基因组缺口问题,我们使用高保真度长 PCR 技术,然后进行单分子实时(SMRT)测序。这项技术产生了长序列读数和强大的覆盖度,允许从头组装跨越整个重复区域的序列。此外,我们使用覆盖中央外显子的 PCR 扩增子的 SMRT 测序来鉴定四个个体的遗传变异。结果表明,CysD 结构域的串联重复序列发生了片段重复,串联重复序列发生了插入/缺失(indels),以及单核苷酸变异。进一步的研究表明,鉴定出的一个串联重复插入序列被非外显子单核苷酸多态性标记。综上所述,这些数据说明了 SMRT 测序长读长成功地用于从头组装大型重复序列,以填补人类基因组中的缺口。MUC5AC 基因的特征及其中央外显子中的序列变异将促进对这一关键气道粘蛋白的遗传和功能研究。