Qingge Letu, Liu Xiaowen, Zhong Farong, Zhu Binhai
IEEE Trans Nanobioscience. 2017 Mar;16(2):123-130. doi: 10.1109/TNB.2017.2666780. Epub 2017 Feb 9.
In mass spectrometry-based de novo protein sequencing, it is hard to complete the sequence of the whole protein. Motivated by this, we study the (one-sided) problem of filling a protein scaffold S with some missing amino acids, given a sequence of contigs none of which is allowed to be altered, with respect to a complete reference protein P of length n , such that the BLOSUM62 score between P and the filled sequence S' is maximized. We show that this problem is polynomial-time solvable in O(n) time. We also consider the case when the contigs are not of high quality and they are concatenated into an (incomplete) sequence I , where the missing amino acids can be inserted anywhere in I to obtain I' , such that the BLOSUM62 score between P and I' is maximized. We show that this problem is polynomial-time solvable in O(n) time. Due to the high time complexity, both of these algorithms are impractical, we hence present several algorithms based on greedy and local search, trying to solve the problems practically. The empirical results, based on some antibody and mammalian proteins, show that the algorithms can fill protein scaffolds with high quality, provided that a good pair of scaffold and reference are given.
在基于质谱的从头蛋白质测序中,很难完成整个蛋白质的序列测定。受此启发,我们研究了这样一个(单边)问题:给定一系列重叠群(其中任何一个都不允许改变),对于长度为n的完整参考蛋白质P,用一些缺失的氨基酸填充蛋白质支架S,使得P与填充后的序列S'之间的BLOSUM62得分最大化。我们证明这个问题在O(n)时间内是多项式时间可解的。我们还考虑了重叠群质量不高且它们被连接成一个(不完整)序列I的情况,其中缺失的氨基酸可以插入到I中的任何位置以获得I',使得P与I'之间的BLOSUM62得分最大化。我们证明这个问题在O(n)时间内是多项式时间可解的。由于时间复杂度高,这两种算法都不实用,因此我们提出了几种基于贪心和局部搜索的算法,试图实际解决这些问题。基于一些抗体和哺乳动物蛋白质的实验结果表明,只要给出一对良好的支架和参考,这些算法就能高质量地填充蛋白质支架。