Friis Susanne L, Buchard Anders, Rockenbauer Eszter, Børsting Claus, Morling Niels
Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark.
Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark.
Forensic Sci Int Genet. 2016 Mar;21:68-75. doi: 10.1016/j.fsigen.2015.12.006. Epub 2015 Dec 12.
This work introduces the in-house developed Python application STRinNGS for analysis of STR sequence elements in BAM or FASTQ files. STRinNGS identifies sequence reads with STR loci by their flanking sequences, it analyses the STR sequence and the flanking regions, and generates a report with the assigned SNP-STR alleles. The main output file from STRinNGS contains all sequences with read counts above 1% of the total number of reads per locus. STR sequences are automatically named according to the nomenclature used previously and according to the repeat unit definitions in STRBase (http://www.cstl.nist.gov/strbase/). The sequences are named with (1) the locus name, (2) the length of the repeat region divided by the length of the repeat unit, (3) the sequence(s) of the repeat unit(s) followed by the number of repeats and (4) variations in the flanking regions. Lower case letters in the main output file are used to flag sequences with previously unknown variations in the STRs. SNPs in the flanking regions are named by their "rs" numbers and the nucleotides in the SNP position. Data from 207 Danes sequenced with the Ion Torrent™ HID STR 10-plex that amplified nine STRs (CSF1PO, D3S1358, D5S818, D7S820, D8S1179, D16S539, TH01, TPOX, vWA), and Amelogenin was analysed with STRinNGS. Sequencing uncovered five common SNPs near four STRs and revealed 20 new alleles in the 207 Danes. Three short homopolymers in the D8S1179 flanking regions caused frequent sequencing errors. In 29 of 3726 allele calls (0.8%), sequences with homopolymer errors were falsely assigned as true alleles. An in-house developed script in R compensated for these errors by compiling sequence reads that had identical STR sequences and identical nucleotides in the five common SNPs. In the output file from the R script, all SNP-STR haplotype calls were correct. The 207 samples and six additional samples were sequenced for D3S1358, D12S391, and D21S11 using the 454 GS Junior platform in this and a previous work. Overall, next generation sequencing (NGS) of the 11 STRs lowered the mean match probability 386 times and increased the typical paternity indexes (i.e. the geometric mean) for trios and duos 47 and 23 times, respectively, compared to the traditional PCR-CE typing of the same population.
这项工作介绍了内部开发的Python应用程序STRinNGS,用于分析BAM或FASTQ文件中的STR序列元件。STRinNGS通过侧翼序列识别具有STR位点的序列读数,分析STR序列及其侧翼区域,并生成包含指定SNP-STR等位基因的报告。STRinNGS的主要输出文件包含所有读数计数超过每个位点读数总数1%的序列。STR序列根据先前使用的命名法以及STRBase(http://www.cstl.nist.gov/strbase/)中的重复单元定义自动命名。序列命名包括:(1)位点名称;(2)重复区域长度除以重复单元长度;(3)重复单元序列,后面跟着重复次数;(4)侧翼区域的变异。主要输出文件中的小写字母用于标记STR中先前未知变异的序列。侧翼区域的SNP通过其“rs”编号和SNP位置的核苷酸命名。使用Ion Torrent™ HID STR 10重试剂盒对207名丹麦人进行测序,该试剂盒扩增了9个STR(CSF1PO、D3S1358、D5S818、D7S820、D8S1179、D16S539、TH01、TPOX、vWA),并使用STRinNGS分析了牙釉蛋白。测序发现四个STR附近有五个常见SNP,并在207名丹麦人中发现了20个新等位基因。D8S1179侧翼区域的三个短同聚物导致频繁的测序错误。在3726个等位基因调用中,有29个(0.8%),具有同聚物错误的序列被错误地指定为真实等位基因。R语言中内部开发的脚本通过汇编在五个常见SNP中具有相同STR序列和相同核苷酸的序列读数来补偿这些错误。在R脚本的输出文件中,所有SNP-STR单倍型调用都是正确的。在这项工作和之前的一项工作中,使用454 GS Junior平台对207个样本和另外六个样本进行了D3S1358、D12S391和D21S11的测序。总体而言,与同一人群的传统PCR-CE分型相比,对11个STR进行下一代测序(NGS)将平均匹配概率降低了386倍,将三联体和二联体的典型父权指数(即几何平均值)分别提高了47倍和23倍。