Schleif Frank-Michael
School of Computer Science, University of Birmingham, Birmingham, Edgbaston, B15 2TT, UK.
Methods Mol Biol. 2016;1362:185-95. doi: 10.1007/978-1-4939-3106-4_12.
Sequence data are widely used to get a deeper insight into biological systems. From a data analysis perspective they are given as a set of sequences of symbols with varying length. In general they are compared using nonmetric score functions. In this form the data are nonstandard, because they do not provide an immediate metric vector space and their analysis using standard methods is complicated. In this chapter we provide various strategies for how to analyze these type of data in a mathematically accurate way instead of the often seen ad hoc solutions. Our approach is based on the scoring values from protein sequence data although be applicable in a broader sense. We discuss potential recoding concepts of the scores and discuss algorithms to solve clustering, classification and embedding tasks for score data for a protein sequence application.
序列数据被广泛用于更深入地了解生物系统。从数据分析的角度来看,它们被表示为一组长度各异的符号序列。一般来说,它们是使用非度量评分函数进行比较的。以这种形式呈现的数据是非标准的,因为它们没有提供直接的度量向量空间,并且使用标准方法对其进行分析很复杂。在本章中,我们提供了各种策略,以便以数学上精确的方式分析这类数据,而不是常见的临时解决方案。我们的方法基于蛋白质序列数据的评分值,不过在更广泛的意义上也是适用的。我们讨论了评分的潜在重新编码概念,并讨论了用于解决蛋白质序列应用中评分数据的聚类、分类和嵌入任务的算法。