Tran Ngoc Hieu, Zhang Xianglilan, Xin Lei, Shan Baozhen, Li Ming
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing 100071, China.
Proc Natl Acad Sci U S A. 2017 Aug 1;114(31):8247-8252. doi: 10.1073/pnas.1705691114. Epub 2017 Jul 18.
De novo peptide sequencing from tandem MS data is the key technology in proteomics for the characterization of proteins, especially for new sequences, such as mAbs. In this study, we propose a deep neural network model, DeepNovo, for de novo peptide sequencing. DeepNovo architecture combines recent advances in convolutional neural networks and recurrent neural networks to learn features of tandem mass spectra, fragment ions, and sequence patterns of peptides. The networks are further integrated with local dynamic programming to solve the complex optimization task of de novo sequencing. We evaluated the method on a wide variety of species and found that DeepNovo considerably outperformed state of the art methods, achieving 7.7-22.9% higher accuracy at the amino acid level and 38.1-64.0% higher accuracy at the peptide level. We further used DeepNovo to automatically reconstruct the complete sequences of antibody light and heavy chains of mouse, achieving 97.5-100% coverage and 97.2-99.5% accuracy, without assisting databases. Moreover, DeepNovo is retrainable to adapt to any sources of data and provides a complete end-to-end training and prediction solution to the de novo sequencing problem. Not only does our study extend the deep learning revolution to a new field, but it also shows an innovative approach in solving optimization problems by using deep learning and dynamic programming.
从串联质谱数据中进行从头肽测序是蛋白质组学中表征蛋白质的关键技术,尤其适用于新序列,如单克隆抗体。在本研究中,我们提出了一种用于从头肽测序的深度神经网络模型DeepNovo。DeepNovo架构结合了卷积神经网络和循环神经网络的最新进展,以学习串联质谱、碎片离子和肽段序列模式的特征。这些网络进一步与局部动态规划相结合,以解决从头测序的复杂优化任务。我们在多种物种上评估了该方法,发现DeepNovo显著优于现有方法,在氨基酸水平上准确率提高了7.7 - 22.9%,在肽段水平上准确率提高了38.1 - 64.0%。我们进一步使用DeepNovo自动重建小鼠抗体轻链和重链的完整序列,在无需辅助数据库的情况下,覆盖率达到97.5 - 100%,准确率达到97.2 - 99.5%。此外,DeepNovo可重新训练以适应任何数据来源,并为从头测序问题提供了完整的端到端训练和预测解决方案。我们的研究不仅将深度学习革命扩展到了一个新领域,还展示了一种通过使用深度学习和动态规划来解决优化问题的创新方法。