Abante Jordi, Ghaffari Noushin, Johnson Charles D, Datta Aniruddha
Whitaker Biomedical Engineering Institute, Johns Hopkins University, 3400 N Charles St, Baltimore, MD, USA.
Center for Bioinformatics and Genomic Systems Engineering (CBGSE), 101 Gateway Blvd., College Station, TX, USA.
BMC Genomics. 2017 Sep 5;18(1):694. doi: 10.1186/s12864-017-3965-2.
The information content of genomes plays a crucial role in the existence and proper development of living organisms. Thus, tremendous effort has been dedicated to developing DNA sequencing technologies that provide a better understanding of the underlying mechanisms of cellular processes. Advances in the development of sequencing technology have made it possible to sequence genomes in a relatively fast and inexpensive way. However, as with any measurement technology, there is noise involved and this needs to be addressed to reach conclusions based on the resulting data. In addition, there are multiple intermediate steps and degrees of freedom when constructing genome assemblies that lead to ambiguous and inconsistent results among assemblers.
Here we introduce HiMMe, an HMM-based tool that relies on genetic patterns to score genome assemblies. Through a Markov chain, the model is able to detect characteristic genetic patterns, while, by introducing emission probabilities, the noise involved in the process is taken into account. Prior knowledge can be used by training the model to fit a given organism or sequencing technology.
Our results show that the method presented is able to recognize patterns even with relatively small k-mer size choices and limited computational resources.
Our methodology provides an individual quality metric per contig in addition to an overall genome assembly score, with a time complexity well below that of an aligner. Ultimately, HiMMe provides meaningful statistical insights that can be leveraged by researchers to better select contigs and genome assemblies for downstream analysis.
基因组的信息内容在生物体的生存和正常发育中起着至关重要的作用。因此,人们付出了巨大努力来开发DNA测序技术,以便更好地理解细胞过程的潜在机制。测序技术的发展进步使得以相对快速且廉价的方式对基因组进行测序成为可能。然而,与任何测量技术一样,其中存在噪声,需要解决这一问题才能基于所得数据得出结论。此外,在构建基因组组装体时存在多个中间步骤和自由度,这导致组装程序之间产生模糊和不一致的结果。
在此我们介绍HiMMe,一种基于隐马尔可夫模型(HMM)的工具,它依靠遗传模式对基因组组装体进行评分。通过马尔可夫链,该模型能够检测特征遗传模式,同时,通过引入发射概率,将过程中涉及的噪声考虑在内。通过训练模型以适应给定的生物体或测序技术,可以利用先验知识。
我们的结果表明,即使在相对较小的k-mer大小选择和有限的计算资源情况下,所提出的方法也能够识别模式。
我们的方法除了提供总体基因组组装评分外,还为每个重叠群提供单独的质量指标,其时间复杂度远低于比对器。最终,HiMMe提供了有意义的统计见解,研究人员可以利用这些见解更好地选择重叠群和基因组组装体用于下游分析。