Basodi Sunitha, Baykal Pelin Icer, Zelikovsky Alex, Skums Pavel, Pan Yi
Department of Computer Science, Georgia State University, 25 Park Place NE, Atlanta, GA, 30303, USA.
The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 11991, Russia.
BMC Genomics. 2020 Dec 21;21(Suppl 6):405. doi: 10.1186/s12864-020-6661-6.
Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Although machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded by multiple challenges associated with technological limitations and biases, difficulty of selection of relevant features and need to compare genomic datasets of different sizes and structures.
We propose a novel preprocessing approach to transform irregular genomic data into normalized image data. Such representation allows to restate the problems of classification and comparison of heterogeneous populations as image classification problems which can be solved using variety of available machine learning tools. We then apply the proposed approach to two important problems in molecular epidemiology: inference of viral infection stage and detection of viral transmission clusters using next-generation sequencing data. The infection staging method has been applied to HCV HVR1 samples collected from 108 recently and 257 chronically infected individuals. The SVM-based image classification approach achieved more than 95% accuracy for both recently and chronically HCV-infected individuals. Clustering has been performed on the data collected from 33 epidemiologically curated outbreaks, yielding more than 97% accuracy.
Sequence image normalization method allows for a robust conversion of genomic data into numerical data and overcomes several issues associated with employing machine learning methods to viral populations. Image data also help in the visualization of genomic data. Experimental results demonstrate that the proposed method can be successfully applied to different problems in molecular epidemiology and surveillance of viral diseases. Simple binary classifiers and clustering techniques applied to the image data are equally or more accurate than other models.
对诸如病毒准种等异质群体进行分析是最具挑战性的生物信息学问题之一。尽管机器学习模型正越来越广泛地用于分析来自此类群体的序列数据,但其直接应用受到与技术限制和偏差相关的多重挑战、相关特征选择的困难以及比较不同大小和结构的基因组数据集的需求的阻碍。
我们提出了一种新颖的预处理方法,将不规则的基因组数据转换为归一化的图像数据。这种表示方式允许将异质群体的分类和比较问题重新表述为图像分类问题,这些问题可以使用各种现有的机器学习工具来解决。然后,我们将所提出的方法应用于分子流行病学中的两个重要问题:使用下一代测序数据推断病毒感染阶段和检测病毒传播簇。感染分期方法已应用于从108名近期感染和257名慢性感染个体收集的HCV HVR1样本。基于支持向量机的图像分类方法对近期和慢性HCV感染个体的准确率均超过95%。对从33次经过流行病学整理的疫情中收集的数据进行了聚类,准确率超过97%。
序列图像归一化方法允许将基因组数据稳健地转换为数值数据,并克服了与将机器学习方法应用于病毒群体相关的几个问题。图像数据也有助于基因组数据的可视化。实验结果表明,所提出的方法可以成功应用于分子流行病学和病毒疾病监测中的不同问题。应用于图像数据的简单二元分类器和聚类技术与其他模型一样准确或更准确。