Suppr超能文献

图遍历编辑距离及其扩展

Graph Traversal Edit Distance and Extensions.

作者信息

Ebrahimpour Boroojeny Ali, Shrestha Akash, Sharifi-Zarchi Ali, Gallagher Suzanne Renick, Sahinalp S Cenk, Chitsaz Hamidreza

机构信息

Department of Computer Science, Colorado State University, Fort Collins, Colorado.

Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.

出版信息

J Comput Biol. 2020 Mar;27(3):317-329. doi: 10.1089/cmb.2019.0511. Epub 2020 Feb 13.

Abstract

Many problems in applied machine learning deal with graphs (also called networks), including social networks, security, web data mining, protein function prediction, and genome informatics. The kernel paradigm beautifully decouples the learning algorithm from the underlying geometric space, which renders graph kernels important for the aforementioned applications. In this article, we give a new graph kernel, which we call graph traversal edit distance (GTED). We introduce the GTED problem and give the first polynomial time algorithm for it. Informally, the GTED is the minimum edit distance between two strings formed by the edge labels of respective Eulerian traversals of the two graphs. Also, GTED is motivated by and provides the first mathematical formalism for sequence co-assembly and de novo variation detection in bioinformatics. We demonstrate that GTED admits a polynomial time algorithm using a linear program in the graph product space that is guaranteed to yield an integer solution. To the best of our knowledge, this is the first approach to this problem. We also give a linear programming relaxation algorithm for a lower bound on GTED. We use GTED as a graph kernel and evaluate it by computing the accuracy of a support vector machine (SVM) classifier on a few data sets in the literature. Our results suggest that our kernel outperforms many of the common graph kernels in the tested data sets. As a second set of experiments, we successfully cluster viral genomes using GTED on their assembly graphs obtained from de novo assembly of next-generation sequencing reads.

摘要

应用机器学习中的许多问题都涉及图(也称为网络),包括社交网络、安全、网络数据挖掘、蛋白质功能预测和基因组信息学。核范式巧妙地将学习算法与底层几何空间解耦,这使得图核对于上述应用非常重要。在本文中,我们给出了一种新的图核,我们称之为图遍历编辑距离(GTED)。我们引入了GTED问题并给出了首个针对它的多项式时间算法。非正式地说,GTED是由两个图各自的欧拉遍历的边标签形成的两个字符串之间的最小编辑距离。此外,GTED受到生物信息学中序列共组装和从头变异检测的启发,并为其提供了首个数学形式化方法。我们证明了GTED在图乘积空间中使用线性规划允许一个多项式时间算法,该算法保证能产生整数解。据我们所知,这是解决这个问题的第一种方法。我们还给出了一个用于GTED下界的线性规划松弛算法。我们将GTED用作图核,并通过计算支持向量机(SVM)分类器在文献中的一些数据集上的准确率来对其进行评估。我们的结果表明,在测试数据集中,我们的核优于许多常见的图核。作为第二组实验,我们使用GTED对从新一代测序读数的从头组装获得的病毒基因组组装图成功进行了聚类。

相似文献

1
Graph Traversal Edit Distance and Extensions.图遍历编辑距离及其扩展
J Comput Biol. 2020 Mar;27(3):317-329. doi: 10.1089/cmb.2019.0511. Epub 2020 Feb 13.
6
A binary linear programming formulation of the graph edit distance.图编辑距离的二元线性规划公式化表述。
IEEE Trans Pattern Anal Mach Intell. 2006 Aug;28(8):1200-14. doi: 10.1109/TPAMI.2006.152.
10
Approximate Graph Edit Distance in Quadratic Time.二次时间内的近似图编辑距离。
IEEE/ACM Trans Comput Biol Bioinform. 2020 Mar-Apr;17(2):483-494. doi: 10.1109/TCBB.2015.2478463. Epub 2015 Sep 14.

本文引用的文献

2
Efficient Synergistic Single-Cell Genome Assembly.高效协同的单细胞基因组组装。
Front Bioeng Biotechnol. 2016 May 23;4:42. doi: 10.3389/fbioe.2016.00042. eCollection 2016.
4
SEQuel: improving the accuracy of genome assemblies.SEQuel:提高基因组组装的准确性。
Bioinformatics. 2012 Jun 15;28(12):i188-96. doi: 10.1093/bioinformatics/bts219.
8
9
Protein function prediction via graph kernels.通过图核进行蛋白质功能预测。
Bioinformatics. 2005 Jun;21 Suppl 1:i47-56. doi: 10.1093/bioinformatics/bti1007.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验