Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China.
Department of Biological Sciences, Chicago State University, Chicago, United States of America.
Infect Genet Evol. 2020 Jan;77:104080. doi: 10.1016/j.meegid.2019.104080. Epub 2019 Nov 1.
HIV-1 is the most common and pathogenic strain of human immunodeficiency virus consisting of many subtypes. To study the difference among HIV-1 subtypes in infection, diagnosis and drug design, it is important to identify HIV-1 subtypes from clinical HIV-1 samples. In this work, we propose an effective numeric representation called Subsequence Natural Vector (SNV) to encode HIV-1 sequences. Using the representation, we introduce an improved linear discriminant analysis method to classify HIV-1 viruses correctly. SNV is based on distribution of nucleotides in HIV-1 viral sequences. It not only computes the number of nucleotides, but also describes the position and variance of nucleotides in viruses. To validate our alignment-free method, 6902 complete genomes and 11,668 pol gene sequences of HIV-1 subtypes were collected from the up-to-date Los Alamos HIV database. SNV outperforms the three popular methods, Kameris, Comet and REGA, with almost 100% Sensitivity and Specificity, also with much less time. Our subtyping algorithm especially works better for circulating recombinant forms (CRFs) consisting of a few sequences. Our approach is also powerful to separate unique recombinant forms (URFs) from other subtypes with 100% Sensitivity and Specificity. Moreover, phylogenetic trees based on SNV representation are constructed using full-length HIV-1 genomes and pol genes respectively, where viruses from the same subtype are clustered together correctly.
HIV-1 是最常见和致病性的人类免疫缺陷病毒株,由许多亚型组成。为了研究 HIV-1 亚型在感染、诊断和药物设计方面的差异,从临床 HIV-1 样本中鉴定 HIV-1 亚型非常重要。在这项工作中,我们提出了一种有效的数字表示方法,称为子序列自然向量(Subsequence Natural Vector,SNV),用于编码 HIV-1 序列。使用这种表示方法,我们引入了一种改进的线性判别分析方法,可以正确地对 HIV-1 病毒进行分类。SNV 基于 HIV-1 病毒序列中核苷酸的分布。它不仅计算核苷酸的数量,还描述病毒中核苷酸的位置和方差。为了验证我们的无比对方法,我们从最新的 Los Alamos HIV 数据库中收集了 6902 个完整基因组和 11668 个 pol 基因序列的 HIV-1 亚型。SNV 优于三种流行的方法,Kameris、Comet 和 REGA,具有几乎 100%的敏感性和特异性,而且速度也快得多。我们的分型算法特别适用于由少数序列组成的循环重组形式(Circulating Recombinant Form,CRF)。我们的方法也非常强大,可以将独特重组形式(Unique Recombinant Form,URF)与其他亚型分开,具有 100%的敏感性和特异性。此外,还分别使用全长 HIV-1 基因组和 pol 基因构建了基于 SNV 表示的系统发育树,其中来自同一亚型的病毒被正确地聚类在一起。