Suppr超能文献

学习向量量化作为一种基于RNA序列检测新冠病毒类型的可解释分类器。

Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences.

作者信息

Kaden Marika, Bohnsack Katrin Sophie, Weber Mirko, Kudła Mateusz, Gutowska Kaja, Blazewicz Jacek, Villmann Thomas

机构信息

University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany.

Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany.

出版信息

Neural Comput Appl. 2022;34(1):67-78. doi: 10.1007/s00521-021-06018-2. Epub 2021 Apr 27.

Abstract

UNLABELLED

We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1007/s00521-021-06018-2.

摘要

未标注

我们提出了一种基于RNA序列描述来区分严重急性呼吸综合征冠状病毒2(SARS-CoV-2)病毒类型的方法,无需进行序列比对。为此,通过特征提取对序列进行预处理,并通过基于原型的分类对所得特征向量进行分析,以保持可解释性。特别是,我们建议基于RNA序列数据的差异度量使用学习向量量化(LVQ)的变体。相应的矩阵LVQ提供了关于分类决策的额外知识,如判别特征相关性,此外,还可以配备易于实现的不确定数据拒绝选项。这些选项提供了自我控制的证据,即如果模型对所呈现数据的证据不足,模型将拒绝做出分类决策。该模型首先使用全球共享流感数据倡议组织(GISAID)数据集进行训练,该数据集根据系统发育树聚类在冠状病毒群体中的分子差异检测出给定的病毒类型。在第二步中,我们将训练好的模型应用于另一个未标记的SARS-CoV-2病毒数据集。对于这些数据,我们可以为序列指定病毒类型或拒绝非典型样本。那些被拒绝的序列有助于根据病毒序列中的核苷酸碱基突变推测新的病毒类型。此外,这种拒绝分析提高了模型的鲁棒性。最后但同样重要的是,与基于(多重)序列比对的方法相比,所提出的方法具有更低的计算复杂度。

补充信息

在线版本包含可在10.1007/s00521-021-06018-2获取的补充材料。

相似文献

引用本文的文献

本文引用的文献

5
Genotyping coronavirus SARS-CoV-2: methods and implications.冠状病毒 SARS-CoV-2 的基因分型:方法与意义。
Genomics. 2020 Sep;112(5):3588-3596. doi: 10.1016/j.ygeno.2020.04.016. Epub 2020 Apr 27.
7
Phylogenetic network analysis of SARS-CoV-2 genomes.SARS-CoV-2 基因组的系统发育网络分析。
Proc Natl Acad Sci U S A. 2020 Apr 28;117(17):9241-9243. doi: 10.1073/pnas.2004999117. Epub 2020 Apr 8.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验