Huang Austin, Kantor Rami, DeLong Allison, Schreier Leeann, Istrail Sorin
Division of Infectious Disease, Computer Science Department, Brown University, Box 1910, Providence, RI 02912, USA.
In Silico Biol. 2011;11(5-6):193-201. doi: 10.3233/ISB-2012-0454.
Next generation sequencing technologies have recently been applied to characterize mutational spectra of the heterogeneous population of viral genotypes (known as a quasispecies) within HIV-infected patients. Such information is clinically relevant because minority genetic subpopulations of HIV within patients enable viral escape from selection pressures such as the immune response and antiretroviral therapy. However, methods for quasispecies sequence reconstruction from next generation sequencing reads are not yet widely used and remains an emerging area of research. Furthermore, the majority of research methodology in HIV has focused on 454 sequencing, while many next-generation sequencing platforms used in practice are limited to shorter read lengths relative to 454 sequencing. Little work has been done in determining how best to address the read length limitations of other platforms. The approach described here incorporates graph representations of both read differences and read overlap to conservatively determine the regions of the sequence with sufficient variability to separate quasispecies sequences. Within these tractable regions of quasispecies inference, we use constraint programming to solve for an optimal quasispecies subsequence determination via vertex coloring of the conflict graph, a representation which also lends itself to data with non-contiguous reads such as paired-end sequencing. We demonstrate the utility of the method by applying it to simulations based on actual intra-patient clonal HIV-1 sequencing data.
新一代测序技术最近已被应用于表征HIV感染患者体内病毒基因型的异质群体(称为准种)的突变谱。此类信息具有临床相关性,因为患者体内HIV的少数遗传亚群能够使病毒逃避免疫反应和抗逆转录病毒疗法等选择压力。然而,从新一代测序读数重建准种序列的方法尚未得到广泛应用,仍然是一个新兴的研究领域。此外,HIV研究方法大多集中在454测序上,而实际使用的许多新一代测序平台相对于454测序而言,读长较短。在确定如何最好地解决其他平台读长限制方面,所做的工作很少。这里描述的方法结合了读段差异和读段重叠的图形表示,以保守地确定序列中具有足够变异性以区分准种序列的区域。在这些易于处理的准种推断区域内,我们使用约束规划通过冲突图的顶点着色来求解最优准种子序列的确定,这种表示法也适用于具有非连续读段的数据,如双端测序数据。我们通过将该方法应用于基于患者体内实际HIV-1克隆测序数据的模拟来证明该方法的实用性。