School of Computer, Electronic and Information, Guangxi University, Nanning 530004, China.
J Theor Biol. 2011 Jan 21;269(1):280-6. doi: 10.1016/j.jtbi.2010.11.002. Epub 2010 Nov 5.
Many clustering approaches have been developed for biological data analysis, however, the application of traditional clustering algorithms for RNA structure data analysis is still a challenging issue. This arises from the existence of complex secondary structures while clustering. One of the most critical issues of cluster analysis is the development of appropriate distance measures in high dimensional space. The traditional distance measures focus on scale issues, but ignores the correlation between two values. This article develops a novel interval-based distance (Hausdorff) measure for computing the similarity between characterized structures. Three relationships including perfect match, partially overlapped and non-overlapped are considered. Finally, we demonstrate the methods by analyzing a data set of RNA secondary structures from the Rfam database.
许多聚类方法已经被开发用于生物数据分析,然而,传统聚类算法在 RNA 结构数据分析中的应用仍然是一个具有挑战性的问题。这是由于在聚类时存在复杂的二级结构。聚类分析中最关键的问题之一是在高维空间中开发合适的距离度量。传统的距离度量侧重于尺度问题,但忽略了两个值之间的相关性。本文提出了一种新的基于区间的距离(Hausdorff)度量方法,用于计算特征结构之间的相似性。考虑了三种关系,包括完全匹配、部分重叠和不重叠。最后,我们通过分析来自 Rfam 数据库的 RNA 二级结构数据集来演示这些方法。