Bohnsack Katrin Sophie, Kaden Marika, Abel Julia, Saralajew Sascha, Villmann Thomas
Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany.
Bosch Center for Artificial Intelligence, 71272 Renningen, Germany.
Entropy (Basel). 2021 Oct 17;23(10):1357. doi: 10.3390/e23101357.
In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.
在本文中,我们提出将互信息函数的变体应用为生物分子序列的特征指纹,用于分类分析。特别地,我们考虑基于香农熵、雷尼熵和Tsallis熵的解析互信息函数。结合基于广义学习向量量化的可解释机器学习分类器模型,实现了一种强大的序列分类方法,该方法除了由于模型固有的稳健性而具有高分类能力外,还允许大量的知识提取。所使用的分类器的任何潜在(轻微)性能劣势都由可解释模型提供的额外知识来弥补。这些知识可以帮助用户分析和理解所使用的数据以及所考虑的任务。在对这些概念进行理论论证之后,我们针对涵盖生物分子序列分析不同领域的各种示例数据集展示了该方法。