一种用于 HIV-1 亚型分类的新型无比对方法。

A novel alignment-free method for HIV-1 subtype classification.

机构信息

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China.

Department of Biological Sciences, Chicago State University, Chicago, United States of America.

出版信息

Infect Genet Evol. 2020 Jan;77:104080. doi: 10.1016/j.meegid.2019.104080. Epub 2019 Nov 1.

DOI:10.1016/j.meegid.2019.104080

PMID:31683009

Abstract

HIV-1 is the most common and pathogenic strain of human immunodeficiency virus consisting of many subtypes. To study the difference among HIV-1 subtypes in infection, diagnosis and drug design, it is important to identify HIV-1 subtypes from clinical HIV-1 samples. In this work, we propose an effective numeric representation called Subsequence Natural Vector (SNV) to encode HIV-1 sequences. Using the representation, we introduce an improved linear discriminant analysis method to classify HIV-1 viruses correctly. SNV is based on distribution of nucleotides in HIV-1 viral sequences. It not only computes the number of nucleotides, but also describes the position and variance of nucleotides in viruses. To validate our alignment-free method, 6902 complete genomes and 11,668 pol gene sequences of HIV-1 subtypes were collected from the up-to-date Los Alamos HIV database. SNV outperforms the three popular methods, Kameris, Comet and REGA, with almost 100% Sensitivity and Specificity, also with much less time. Our subtyping algorithm especially works better for circulating recombinant forms (CRFs) consisting of a few sequences. Our approach is also powerful to separate unique recombinant forms (URFs) from other subtypes with 100% Sensitivity and Specificity. Moreover, phylogenetic trees based on SNV representation are constructed using full-length HIV-1 genomes and pol genes respectively, where viruses from the same subtype are clustered together correctly.

摘要

HIV-1 是最常见和致病性的人类免疫缺陷病毒株，由许多亚型组成。为了研究 HIV-1 亚型在感染、诊断和药物设计方面的差异，从临床 HIV-1 样本中鉴定 HIV-1 亚型非常重要。在这项工作中，我们提出了一种有效的数字表示方法，称为子序列自然向量（Subsequence Natural Vector，SNV），用于编码 HIV-1 序列。使用这种表示方法，我们引入了一种改进的线性判别分析方法，可以正确地对 HIV-1 病毒进行分类。SNV 基于 HIV-1 病毒序列中核苷酸的分布。它不仅计算核苷酸的数量，还描述病毒中核苷酸的位置和方差。为了验证我们的无比对方法，我们从最新的 Los Alamos HIV 数据库中收集了 6902 个完整基因组和 11668 个 pol 基因序列的 HIV-1 亚型。SNV 优于三种流行的方法，Kameris、Comet 和 REGA，具有几乎 100%的敏感性和特异性，而且速度也快得多。我们的分型算法特别适用于由少数序列组成的循环重组形式（Circulating Recombinant Form，CRF）。我们的方法也非常强大，可以将独特重组形式（Unique Recombinant Form，URF）与其他亚型分开，具有 100%的敏感性和特异性。此外，还分别使用全长 HIV-1 基因组和 pol 基因构建了基于 SNV 表示的系统发育树，其中来自同一亚型的病毒被正确地聚类在一起。