Institute of Mathematics, University of Innsbruck, Technikerstrasse 13, 6020 Innsbruck, Austria.
Forensic Sci Int Genet. 2011 Mar;5(2):126-32. doi: 10.1016/j.fsigen.2010.10.006. Epub 2010 Nov 5.
The analysis of the haploid mitochondrial (mt) genome has numerous applications in forensic and population genetics, as well as in disease studies. Although mtDNA haplotypes are usually determined by sequencing, they are rarely reported as a nucleotide string. Traditionally they are presented in a difference-coded position-based format relative to the corrected version of the first sequenced mtDNA. This convention requires recommendations for standardized sequence alignment that is known to vary between scientific disciplines, even between laboratories. As a consequence, database searches that are vital for the interpretation of mtDNA data can suffer from biased results when query and database haplotypes are annotated differently. In the forensic context that would usually lead to underestimation of the absolute and relative frequencies. To address this issue we introduce SAM, a string-based search algorithm that converts query and database sequences to position-free nucleotide strings and thus eliminates the possibility that identical sequences will be missed in a database query. The mere application of a BLAST algorithm would not be a sufficient remedy as it uses a heuristic approach and does not address properties specific to mtDNA, such as phylogenetically stable but also rapidly evolving insertion and deletion events. The software presented here provides additional flexibility to incorporate phylogenetic data, site-specific mutation rates, and other biologically relevant information that would refine the interpretation of mitochondrial DNA data. The manuscript is accompanied by freeware and example data sets that can be used to evaluate the new software (http://stringvalidation.org).
单体型线粒体 (mt) 基因组分析在法医学和群体遗传学以及疾病研究中具有多种应用。虽然 mtDNA 单体型通常通过测序来确定,但它们很少以核苷酸序列的形式报告。传统上,它们相对于第一个测序的 mtDNA 的校正版本以差异编码的基于位置的格式呈现。这一惯例要求推荐标准化的序列比对,而这种序列比对在不同的科学学科之间甚至在不同的实验室之间都存在差异。因此,对于 mtDNA 数据的解释至关重要的数据库搜索可能会因查询和数据库单体型的注释方式不同而导致有偏差的结果。在法医环境中,这通常会导致绝对和相对频率的低估。为了解决这个问题,我们引入了 SAM,这是一种基于字符串的搜索算法,它将查询和数据库序列转换为无位置的核苷酸序列,从而消除了在数据库查询中可能会错过相同序列的可能性。仅仅应用 BLAST 算法是不够的,因为它使用启发式方法,并且不能解决与 mtDNA 特定的问题,例如进化上稳定但也快速进化的插入和缺失事件。本文介绍的软件提供了额外的灵活性,可以结合系统发育数据、特定位置的突变率和其他与生物学相关的信息,从而细化对线粒体 DNA 数据的解释。本文附有免费软件和示例数据集,可用于评估新软件 (http://stringvalidation.org)。