Beggel Bastian, Neumann-Fraune Maria, Kaiser Rolf, Verheyen Jens, Lengauer Thomas
Department of Computational Biology and Applied Algorithms, Max Planck Institute for Informatics, Saarbrücken, Germany.
Institute of Virology, University of Cologne, Cologne, Germany.
PLoS One. 2013 Dec 20;8(12):e81687. doi: 10.1371/journal.pone.0081687. eCollection 2013.
Direct Sanger sequencing of viral genome populations yields multiple ambiguous sequence positions. It is not straightforward to derive linkage information from sequencing chromatograms, which in turn hampers the correct interpretation of the sequence data. We present a method for determining the variants existing in a viral quasispecies in the case of two nearby ambiguous sequence positions by exploiting the effect of sequence context-dependent incorporation of dideoxynucleotides. The computational model was trained on data from sequencing chromatograms of clonal variants and was evaluated on two test sets of in vitro mixtures. The approach achieved high accuracies in identifying the mixture components of 97.4% on a test set in which the positions to be analyzed are only one base apart from each other, and of 84.5% on a test set in which the ambiguous positions are separated by three bases. In silico experiments suggest two major limitations of our approach in terms of accuracy. First, due to a basic limitation of Sanger sequencing, it is not possible to reliably detect minor variants with a relative frequency of no more than 10%. Second, the model cannot distinguish between mixtures of two or four clonal variants, if one of two sets of linear constraints is fulfilled. Furthermore, the approach requires repetitive sequencing of all variants that might be present in the mixture to be analyzed. Nevertheless, the effectiveness of our method on the two in vitro test sets shows that short-range linkage information of two ambiguous sequence positions can be inferred from Sanger sequencing chromatograms without any further assumptions on the mixture composition. Additionally, our model provides new insights into the established and widely used Sanger sequencing technology. The source code of our method is made available at http://bioinf.mpi-inf.mpg.de/publications/beggel/linkageinformation.zip.
对病毒基因组群体进行直接桑格测序会产生多个模糊的序列位置。从测序色谱图中获取连锁信息并非易事,这反过来又妨碍了对序列数据的正确解读。我们提出了一种方法,通过利用双脱氧核苷酸序列上下文依赖性掺入的效应,来确定在两个相邻模糊序列位置情况下病毒准种中存在的变体。该计算模型基于克隆变体测序色谱图的数据进行训练,并在两个体外混合物测试集上进行评估。在一个测试集中,待分析的位置彼此仅相隔一个碱基,该方法在识别混合物成分方面的准确率达到了97.4%;在另一个测试集中,模糊位置相隔三个碱基,准确率为84.5%。计算机模拟实验表明,我们的方法在准确性方面存在两个主要局限性。首先,由于桑格测序的基本局限性,无法可靠地检测相对频率不超过10%的次要变体。其次,如果满足两组线性约束中的一组,该模型无法区分两个或四个克隆变体的混合物。此外,该方法需要对可能存在于待分析混合物中的所有变体进行重复测序。尽管如此,我们的方法在两个体外测试集上的有效性表明,无需对混合物组成做任何进一步假设,就可以从桑格测序色谱图中推断出两个模糊序列位置的短程连锁信息。此外,我们的模型为已确立且广泛使用的桑格测序技术提供了新的见解。我们方法的源代码可在http://bioinf.mpi-inf.mpg.de/publications/beggel/linkageinformation.zip获取。