Department of Genetics and Genomic Science and Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Bioinformatics. 2014 Dec 15;30(24):3491-8. doi: 10.1093/bioinformatics/btu437. Epub 2014 Jul 15.
Resolving tandemly repeated genomic sequences is a necessary step in improving our understanding of the human genome. Short tandem repeats (TRs), or microsatellites, are often used as molecular markers in genetics, and clinically, variation in microsatellites can lead to genetic disorders like Huntington's diseases. Accurately resolving repeats, and in particular TRs, remains a challenging task in genome alignment, assembly and variation calling. Though tools have been developed for detecting microsatellites in short-read sequencing data, these are limited in the size and types of events they can resolve. Single-molecule sequencing technologies may potentially resolve a broader spectrum of TRs given their increased length, but require new approaches given their significantly higher raw error profiles. However, due to inherent error profiles of the single-molecule technologies, these reads presents a unique challenge in terms of accurately identifying and estimating the TRs.
Here we present PacmonSTR, a reference-based probabilistic approach, to identify the TR region and estimate the number of these TR elements in long DNA reads. We present a multistep approach that requires as input, a reference region and the reference TR element. Initially, the TR region is identified from the long DNA reads via a 3-stage modified Smith-Waterman approach and then, expected number of TR elements is calculated using a pair-Hidden Markov Models-based method. Finally, TR-based genotype selection (or clustering: homozygous/heterozygous) is performed with Gaussian mixture models, using the Akaike information criteria, and coverage expectations.
解析串联重复的基因组序列是提高人类基因组理解的必要步骤。短串联重复(STRs)或微卫星通常被用作遗传学中的分子标记,临床上,微卫星的变异可能导致亨廷顿病等遗传疾病。准确解析重复序列,特别是 STR,仍然是基因组比对、组装和变异调用中的一项具有挑战性的任务。尽管已经开发了用于检测短读测序数据中微卫星的工具,但这些工具在它们能够解决的事件的大小和类型方面受到限制。单分子测序技术由于其长度增加,可能会解析更广泛的 STR 谱,但由于其原始错误谱显著增加,需要新的方法。然而,由于单分子技术固有的错误谱,这些读段在准确识别和估计 STR 方面提出了独特的挑战。
在这里,我们提出了 PacmonSTR,这是一种基于参考的概率方法,用于识别 TR 区域并估计长 DNA 读段中这些 TR 元件的数量。我们提出了一种多步骤方法,该方法需要输入参考区域和参考 TR 元件。最初,通过三阶段改进的 Smith-Waterman 方法从长 DNA 读段中识别 TR 区域,然后使用基于双隐藏马尔可夫模型的方法计算 TR 元件的预期数量。最后,使用基于高斯混合模型的信息准则和覆盖期望,进行基于 TR 的基因型选择(或聚类:纯合/杂合)。