Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China.
City University of Hong Kong Shenzhen Research Institution, ShenZhen, 518057, China.
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae149.
Proteoform identification is an important problem in proteomics. The main task is to find a modified protein that best fits the input spectrum. To overcome the combinatorial explosion of possible proteoforms, the proteoform mass graph and spectrum mass graph are used to represent the protein database and the spectrum, respectively. The problem becomes finding an optimal alignment between the proteoform mass graph and the spectrum mass graph. Peak error correction is an important issue for computing an optimal alignment between the two input mass graphs.
We propose a faster algorithm for the error correction alignment of spectrum mass graph and proteoform mass graph problem and produce a program package TopMGFast. The newly designed algorithms require less space and running time so that we are able to compute global optimal alignments for the two input mass graphs in a reasonable time. For the local alignment version, experiments show that the running time of the new algorithm is reduced by 2.5 times. For the global alignment version, experiments show that the maximum mass errors between any pair of matched nodes in the alignments obtained by our method are within a small range as designed, while the alignments produced by the state-of-the-art method, TopMG, have very large maximum mass errors for many cases. The obtained alignment sizes are roughly the same for both TopMG and TopMGFast. Of course, TopMGFast needs more running time than TopMG. Therefore, our new algorithm can obtain more reliable global alignments within a reasonable time. This is the first time that global optimal error correction alignments can be obtained using real datasets.
The source code of the algorithm is available at https://github.com/Zeirdo/TopMGFast.
蛋白质组学中的一个重要问题是蛋白质形式的鉴定。主要任务是找到最适合输入光谱的修饰蛋白。为了克服可能的蛋白质形式的组合爆炸,使用蛋白质形式质量图和光谱质量图分别表示蛋白质数据库和光谱。问题变成了在蛋白质形式质量图和光谱质量图之间找到最佳对齐。峰错误校正对于计算两个输入质量图之间的最佳对齐是一个重要问题。
我们提出了一种更快的算法来校正光谱质量图和蛋白质形式质量图问题的错误对齐,并生成了一个程序包 TopMGFast。新设计的算法需要更少的空间和运行时间,因此我们能够在合理的时间内为两个输入质量图计算全局最佳对齐。对于局部对齐版本,实验表明新算法的运行时间减少了 2.5 倍。对于全局对齐版本,实验表明,我们方法得到的对齐中任何一对匹配节点之间的最大质量误差都在设计的小范围内,而最先进的方法 TopMG 得到的对齐则在许多情况下具有非常大的最大质量误差。通过 TopMG 和 TopMGFast 获得的对齐大小大致相同。当然,TopMGFast 需要比 TopMG 更多的运行时间。因此,我们的新算法可以在合理的时间内获得更可靠的全局对齐。这是第一次使用真实数据集获得全局最优的错误校正对齐。
算法的源代码可在 https://github.com/Zeirdo/TopMGFast 上获得。