Hua Keru, Yu Qin, Zhang Ruiming
IEEE/ACM Trans Comput Biol Bioinform. 2016 Sep-Oct;13(5):868-877. doi: 10.1109/TCBB.2015.2495186. Epub 2015 Oct 26.
Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. The distance and similarity between two sequence are very important and widely studied. During the last decades, Similarity(distance) metric learning is one of the hottest topics of machine learning/data mining as well as their applications in the bioinformatics field. It is feasible to introduce machine learning technology to learn similarity metric from biological data. In this paper, we propose a novel framework of guaranteed similarity metric learning (GMSL) to perform alignment of biology sequences in any feature vector space. It introduces the (ϵ, γ, τ)-goodness similarity theory to Mahalanobis metric learning. As a theoretical guaranteed similarity metric learning approach, GMSL guarantees that the learned similarity function performs well in classification and clustering. Our experiments on the most used datasets demonstrate that our approach outperforms the state-of-the-art biological sequences alignment methods and other similarity metric learning algorithms in both accuracy and stability.
序列相似性是生物学中分类和系统发育研究的关键数学概念。两个序列之间的距离和相似性非常重要且受到广泛研究。在过去几十年中,相似性(距离)度量学习是机器学习/数据挖掘及其在生物信息学领域应用中最热门的话题之一。引入机器学习技术从生物数据中学习相似性度量是可行的。在本文中,我们提出了一种新颖的保证相似性度量学习(GMSL)框架,用于在任何特征向量空间中进行生物序列比对。它将(ϵ, γ, τ)-优度相似性理论引入马氏距离度量学习。作为一种具有理论保证的相似性度量学习方法,GMSL保证所学习的相似性函数在分类和聚类中表现良好。我们在最常用数据集上的实验表明,我们的方法在准确性和稳定性方面均优于当前最先进的生物序列比对方法和其他相似性度量学习算法。