Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24061 and Office of Biostatistics Research, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Bioinformatics. 2014 Mar 1;30(5):652-9. doi: 10.1093/bioinformatics/btt595. Epub 2013 Oct 17.
Inferring lengths of inherited microsatellite alleles with single base pair resolution from short sequence reads is challenging due to several sources of noise caused by the repetitive nature of microsatellites and the technologies used to generate raw sequence data.
We have developed a program, GenoTan, using a discretized Gaussian mixture model combined with a rules-based approach to identify inherited variation of microsatellite loci from short sequence reads without paired-end information. It effectively distinguishes length variants from noise including insertion/deletion errors in homopolymer runs by addressing the bidirectional aspect of insertion and deletion errors in sequence reads. Here we first introduce a homopolymer decomposition method which estimates error bias toward insertion or deletion in homopolymer sequence runs. Combining these approaches, GenoTan was able to genotype 94.9% of microsatellite loci accurately from simulated data with 40x sequence coverage quickly while the other programs showed <90% correct calls for the same data and required 5∼30× more computational time than GenoTan. It also showed the highest true-positive rate for real data using mixed sequence data of two Drosophila inbred lines, which was a novel validation approach for genotyping.
GenoTan is open-source software available at http://genotan.sourceforge.net.
由于微卫星的重复性质和用于生成原始序列数据的技术所导致的多种噪声源,从短序列读取推断具有单碱基分辨率的遗传微卫星等位基因的长度具有挑战性。
我们使用离散高斯混合模型结合基于规则的方法开发了一个程序 GenoTan,用于从没有配对末端信息的短序列读取中识别微卫星基因座的遗传变异。它通过解决序列读取中插入/缺失错误的双向方面,有效地将长度变体与噪声(包括同源多聚体序列中的插入/缺失错误)区分开来。在这里,我们首先介绍了一种同源多聚体分解方法,该方法估计了同源多聚体序列运行中插入或缺失的错误偏向。结合这些方法,GenoTan 能够从模拟数据中以 40x 序列覆盖率快速准确地对 94.9%的微卫星基因座进行基因分型,而其他程序对相同数据的正确调用率<90%,并且比 GenoTan 多花费 5∼30 倍的计算时间。它还使用两种果蝇近交系的混合序列数据显示了真实数据的最高真阳性率,这是一种用于基因分型的新验证方法。
GenoTan 是一个开源软件,可在 http://genotan.sourceforge.net 上获得。