Novitsky Vlad, Moyo Sikhulile, Lei Quanhong, DeGruttola Victor, Essex M
1 Harvard School of Public Health AIDS Initiative, Department of Immunology and Infectious Diseases, Harvard School of Public Health , Boston, Massachusetts.
AIDS Res Hum Retroviruses. 2015 May;31(5):531-42. doi: 10.1089/AID.2014.0211. Epub 2015 Feb 6.
To improve the methodology of HIV cluster analysis, we addressed how analysis of HIV clustering is associated with parameters that can affect the outcome of viral clustering. The extent of HIV clustering and tree certainty was compared between 401 HIV-1C near full-length genome sequences and subgenomic regions retrieved from the LANL HIV Database. Sliding window analysis was based on 99 windows of 1,000 bp and 45 windows of 2,000 bp. Potential associations between the extent of HIV clustering and sequence length and the number of variable and informative sites were evaluated. The near full-length genome HIV sequences showed the highest extent of HIV clustering and the highest tree certainty. At the bootstrap threshold of 0.80 in maximum likelihood (ML) analysis, 58.9% of near full-length HIV-1C sequences but only 15.5% of partial pol sequences (ViroSeq) were found in clusters. Among HIV-1 structural genes, pol showed the highest extent of clustering (38.9% at a bootstrap threshold of 0.80), although it was significantly lower than in the near full-length genome sequences. The extent of HIV clustering was significantly higher for sliding windows of 2,000 bp than 1,000 bp. We found a strong association between the sequence length and proportion of HIV sequences in clusters, and a moderate association between the number of variable and informative sites and the proportion of HIV sequences in clusters. In HIV cluster analysis, the extent of detectable HIV clustering is directly associated with the length of viral sequences used, as well as the number of variable and informative sites. Near full-length genome sequences could provide the most informative HIV cluster analysis. Selected subgenomic regions with a high extent of HIV clustering and high tree certainty could also be considered as a second choice.
为改进HIV聚类分析方法,我们探讨了HIV聚类分析与可能影响病毒聚类结果的参数之间的关联。在从洛斯阿拉莫斯国家实验室(LANL)HIV数据库检索的401个HIV-1C近全长基因组序列和亚基因组区域之间,比较了HIV聚类程度和树形确定性。滑动窗口分析基于99个1000bp的窗口和45个2000bp的窗口。评估了HIV聚类程度与序列长度以及可变位点和信息位点数量之间的潜在关联。近全长基因组HIV序列显示出最高的HIV聚类程度和最高的树形确定性。在最大似然(ML)分析中,自展阈值为0.80时,58.9%的近全长HIV-1C序列位于聚类中,但部分pol序列(ViroSeq)中只有15.5%位于聚类中。在HIV-1结构基因中,pol显示出最高的聚类程度(自展阈值为0.80时为38.9%),尽管显著低于近全长基因组序列。2000bp的滑动窗口的HIV聚类程度显著高于1000bp的滑动窗口。我们发现序列长度与聚类中HIV序列比例之间存在强关联,可变位点和信息位点数量与聚类中HIV序列比例之间存在中度关联。在HIV聚类分析中,可检测到的HIV聚类程度与所用病毒序列的长度以及可变位点和信息位点的数量直接相关。近全长基因组序列可为HIV聚类分析提供最丰富的信息。具有高HIV聚类程度和高树形确定性的选定亚基因组区域也可作为第二选择。