Center for Human Identification, University of North Texas Health Science Center, 3500 Camp Bowie Blvd., Fort Worth, TX 76107, USA; Department of Microbiology, Immunology and Genetics, University of North Texas Health Science Center, 3500 Camp Bowie Blvd., Fort Worth, TX 76107, USA.
Center for Human Identification, University of North Texas Health Science Center, 3500 Camp Bowie Blvd., Fort Worth, TX 76107, USA.
Forensic Sci Int Genet. 2021 Mar;51:102459. doi: 10.1016/j.fsigen.2020.102459. Epub 2020 Dec 25.
Unique molecular identifiers (UMIs) are a promising approach to contend with errors generated during PCR and massively parallel sequencing (MPS). With UMI technology, random molecular barcodes are ligated to template DNA molecules prior to PCR, allowing PCR and sequencing error to be tracked and corrected bioinformatically. UMIs have the potential to be particularly informative for the interpretation of short tandem repeats (STRs). Traditional MPS approaches may simply lead to the observation of alleles that are consistent with the hypotheses of stutter, while with UMIs stutter products bioinformatically may be re-associated with their parental alleles and subsequently removed. Herein, a bioinformatics pipeline named strumi is described that is designed for the analysis of STRs that are tagged with UMIs. Unlike other tools, strumi is an alignment-free machine learning driven algorithm that clusters individual MPS reads into UMI families, infers consensus super-reads that represent each family and provides an estimate the resulting haplotype's accuracy. Super-reads, in turn, approximate independent measurements not of the PCR products, but of the original template molecules, both in terms of quantity and sequence identity. Provisional assessments show that naïve threshold-based approaches generate super-reads that are accurate (∼97 % haplotype accuracy, compared to ∼78 % when UMIs are not used), and the application of a more nuanced machine learning approach increases the accuracy to ∼99.5 % depending on the level of certainty desired. With these features, UMIs may greatly simplify probabilistic genotyping systems and reduce uncertainty. However, the ability to interpret alleles at trace levels also permits the interpretation, characterization and quantification of contamination as well as somatic variation (including somatic stutter), which may present newfound challenges.
独特分子标识符 (UMI) 是一种有前途的方法,可以解决 PCR 和大规模并行测序 (MPS) 过程中产生的错误。使用 UMI 技术,在 PCR 之前将随机分子条形码连接到模板 DNA 分子上,允许通过生物信息学跟踪和纠正 PCR 和测序错误。UMI 有可能为短串联重复序列 (STR) 的解释提供特别有价值的信息。传统的 MPS 方法可能只是导致观察到与突发假说一致的等位基因,而使用 UMI 则可以通过生物信息学将突发产物重新关联到其亲本等位基因上,然后将其去除。本文描述了一种名为 strumi 的生物信息学分析流程,该流程专为标记有 UMI 的 STR 分析而设计。与其他工具不同,strumi 是一种无比对的机器学习驱动算法,它将单个 MPS 读取聚类到 UMI 家族中,推断代表每个家族的共识超读取,并提供对所得单倍型准确性的估计。反过来,超读取近似于原始模板分子的独立测量,而不仅仅是 PCR 产物的独立测量,无论是在数量还是序列一致性方面。初步评估表明,基于阈值的简单方法生成的超读取是准确的(单倍型准确性约为 97%,而不使用 UMI 时约为 78%),并且应用更细致的机器学习方法可以根据所需的确定性水平将准确性提高到约 99.5%。有了这些特性,UMI 可以极大地简化概率基因分型系统并降低不确定性。然而,在痕量水平上解释等位基因的能力也允许对污染以及体细胞变异(包括体细胞突发)进行解释、特征描述和定量,这可能会带来新的挑战。