无比对序列比较：从机器学习视角进行的系统综述

Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective.

作者信息

Bohnsack Katrin Sophie, Kaden Marika, Abel Julia, Villmann Thomas

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):119-135. doi: 10.1109/TCBB.2022.3140873. Epub 2023 Feb 3.

DOI:10.1109/TCBB.2022.3140873

Abstract

The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.

摘要

过去几十年间产生的大量生物序列数据，以及算法和硬件的改进，使得在生物信息学中应用机器学习技术成为可能。虽然机器学习界意识到有必要严格区分数据转换和数据比较，并采用合理的组合方式，但在比较序列分析领域，这种意识往往欠缺。随着人们意识到序列比对用于序列比较的缺点，一些典型应用越来越多地使用所谓的无比对方法。鉴于这一发展趋势，我们提出了一个无比对序列比较的概念框架，该框架突出了以下两者的区分：1）由适当的数学序列编码和特征生成组成的序列数据转换，以及2）随后通过特定于问题但数学上一致的接近度度量对转换后的数据进行（不）相似性评估。我们认为编码是一种无信息损失的数据转换，以便获得合适的表示，而特征生成不可避免地会损失信息，目的是仅提取与任务相关的信息。这种区分揭示了众多可用方法，并有助于在机器学习和数据分析中识别合适的方法，以便在这些前提下比较序列。