Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China.
Qiuzhen College, Tsinghua University, Beijing 100084, China.
Genes (Basel). 2023 Jan 10;14(1):186. doi: 10.3390/genes14010186.
For virus classification and tracing, one idea is to generate minimal models from the gene sequences of each virus group for comparative analysis within and between classes, as well as classification and tracing of new sequences. The starting point of defining a minimal model for a group of gene sequences is to find their longest common sequence (LCS), but this is a non-deterministic polynomial-time hard (NP-hard) problem. Therefore, we applied some heuristic approaches of finding LCS, as well as some of the newer methods of treating gene sequences, including multiple sequence alignment (MSA) and k-mer natural vector (NV) encoding. To evaluate our algorithms, a five-fold cross validation classification scheme on a dataset of H1N1 virus non-structural protein 1 (NS1) gene was analyzed. The results indicate that the MSA-based algorithm has the best performance measured by classification accuracy, while the NV-based algorithm exhibits advantages in the time complexity of generating minimal models.
对于病毒分类和溯源,一种思路是从每组病毒的基因序列中生成最小模型,以便在类内和类间进行比较分析,以及对新序列进行分类和溯源。定义一组基因序列最小模型的出发点是找到它们的最长公共序列(LCS),但这是一个非确定性多项式时间难题(NP-hard)。因此,我们应用了一些寻找 LCS 的启发式方法,以及一些处理基因序列的新方法,包括多序列比对(MSA)和 k-mer 自然向量(NV)编码。为了评估我们的算法,我们对 H1N1 病毒非结构蛋白 1(NS1)基因数据集进行了五次交叉验证分类方案分析。结果表明,基于 MSA 的算法在分类准确性方面表现最佳,而基于 NV 的算法在生成最小模型的时间复杂度方面具有优势。