Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr, 1, 37077 Göttingen, Germany.
BMC Bioinformatics. 2011 Apr 11;12:93. doi: 10.1186/1471-2105-12-93.
Methods of determining whether or not any particular HIV-1 sequence stems - completely or in part - from some unknown HIV-1 subtype are important for the design of vaccines and molecular detection systems, as well as for epidemiological monitoring. Nevertheless, a single algorithm only, the Branching Index (BI), has been developed for this task so far. Moving along the genome of a query sequence in a sliding window, the BI computes a ratio quantifying how closely the query sequence clusters with a subtype clade. In its current version, however, the BI does not provide predicted boundaries of unknown fragments.
We have developed Unknown Subtype Finder (USF), an algorithm based on a probabilistic model, which automatically determines which parts of an input sequence originate from a subtype yet unknown. The underlying model is based on a simple profile hidden Markov model (pHMM) for each known subtype and an additional pHMM for an unknown subtype. The emission probabilities of the latter are estimated using the emission frequencies of the known subtypes by means of a (position-wise) probabilistic model for the emergence of new subtypes. We have applied USF to SIV and HIV-1 sequences formerly classified as having emerged from an unknown subtype. Moreover, we have evaluated its performance on artificial HIV-1 recombinants and non-recombinant HIV-1 sequences. The results have been compared with the corresponding results of the BI.
Our results demonstrate that USF is suitable for detecting segments in HIV-1 sequences stemming from yet unknown subtypes. Comparing USF with the BI shows that our algorithm performs as good as the BI or better.
确定特定的 HIV-1 序列是否完全或部分源自未知的 HIV-1 亚型的方法对于疫苗和分子检测系统的设计以及流行病学监测非常重要。然而,迄今为止,仅开发了一种算法,即分支指数(BI),用于完成此任务。BI 沿着查询序列的基因组在滑动窗口中移动,计算一个比例,该比例量化了查询序列与亚型分支聚类的紧密程度。然而,在其当前版本中,BI 不提供未知片段的预测边界。
我们开发了未知亚型查找器(USF),这是一种基于概率模型的算法,它可以自动确定输入序列的哪些部分来自未知的亚型。该模型基于针对每个已知亚型的简单轮廓隐马尔可夫模型(pHMM)和针对未知亚型的附加 pHMM。后者的发射概率是通过使用已知亚型的发射频率,通过出现新亚型的位置概率模型来估计的。我们已经将 USF 应用于以前被归类为源自未知亚型的 SIV 和 HIV-1 序列。此外,我们还评估了其在人工 HIV-1 重组体和非重组 HIV-1 序列上的性能。将结果与 BI 的相应结果进行了比较。
我们的结果表明,USF 适用于检测源自未知亚型的 HIV-1 序列中的片段。将 USF 与 BI 进行比较表明,我们的算法与 BI 一样好或更好。