Bhandari Bikash Kumar, Goldman Nick
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK.
NAR Genom Bioinform. 2024 Sep 18;6(3):lqae126. doi: 10.1093/nargab/lqae126. eCollection 2024 Sep.
Protein sequencing is a rapidly evolving field with much progress towards the realization of a new generation of protein sequencers. The early devices, however, may not be able to reliably discriminate all 20 amino acids, resulting in a partial, noisy and possibly error-prone signature of a protein. Rather than achieving sequencing, these devices may aim to identify target proteins by comparing such signatures to databases of known proteins. However, there are no broadly applicable methods for this identification problem. Here, we devise a hidden Markov model method to study the generalized problem of protein identification from noisy signature data. Based on a hypothetical sequencing device that can simulate several novel technologies, we show that on the human protein database ( = 20 181) our method has a good performance under many different operating conditions such as various levels of signal resolvability, different numbers of discriminated amino acids, sequence fragments, and insertion and deletion error rates. Our results demonstrate the possibility of protein identification with high accuracy on many early experimental devices. We anticipate our method to be applicable for a wide range of protein sequencing devices in the future.
蛋白质测序是一个快速发展的领域,在新一代蛋白质测序仪的实现方面取得了很大进展。然而,早期的设备可能无法可靠地区分所有20种氨基酸,从而产生蛋白质的部分、有噪声且可能容易出错的特征。这些设备的目标可能不是实现测序,而是通过将这些特征与已知蛋白质数据库进行比较来识别目标蛋白质。然而,对于这个识别问题,目前还没有广泛适用的方法。在这里,我们设计了一种隐马尔可夫模型方法来研究从噪声特征数据中进行蛋白质识别的广义问题。基于一个可以模拟几种新技术的假设测序设备,我们表明,在人类蛋白质数据库(=20181)上,我们的方法在许多不同的操作条件下都具有良好的性能,例如各种信号可分辨水平、不同数量的可区分氨基酸、序列片段以及插入和缺失错误率。我们的结果证明了在许多早期实验设备上高精度识别蛋白质的可能性。我们预计我们的方法在未来将适用于广泛的蛋白质测序设备。