Kaden Marika, Bohnsack Katrin Sophie, Weber Mirko, Kudła Mateusz, Gutowska Kaja, Blazewicz Jacek, Villmann Thomas
University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany.
Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany.
Neural Comput Appl. 2022;34(1):67-78. doi: 10.1007/s00521-021-06018-2. Epub 2021 Apr 27.
We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment.
The online version contains supplementary material available at 10.1007/s00521-021-06018-2.
我们提出了一种基于RNA序列描述来区分严重急性呼吸综合征冠状病毒2(SARS-CoV-2)病毒类型的方法,无需进行序列比对。为此,通过特征提取对序列进行预处理,并通过基于原型的分类对所得特征向量进行分析,以保持可解释性。特别是,我们建议基于RNA序列数据的差异度量使用学习向量量化(LVQ)的变体。相应的矩阵LVQ提供了关于分类决策的额外知识,如判别特征相关性,此外,还可以配备易于实现的不确定数据拒绝选项。这些选项提供了自我控制的证据,即如果模型对所呈现数据的证据不足,模型将拒绝做出分类决策。该模型首先使用全球共享流感数据倡议组织(GISAID)数据集进行训练,该数据集根据系统发育树聚类在冠状病毒群体中的分子差异检测出给定的病毒类型。在第二步中,我们将训练好的模型应用于另一个未标记的SARS-CoV-2病毒数据集。对于这些数据,我们可以为序列指定病毒类型或拒绝非典型样本。那些被拒绝的序列有助于根据病毒序列中的核苷酸碱基突变推测新的病毒类型。此外,这种拒绝分析提高了模型的鲁棒性。最后但同样重要的是,与基于(多重)序列比对的方法相比,所提出的方法具有更低的计算复杂度。
在线版本包含可在10.1007/s00521-021-06018-2获取的补充材料。