Savidor Alon, Barzilay Rotem, Elinger Dalia, Yarden Yosef, Lindzen Moshit, Gabashvili Alexandra, Adiv Tal Ophir, Levin Yishai
From ‡The Nancy and Stephen Grand Israel National Center for Personalized Medicine, Weizmann Institute of Science, Rehovot.
the §Department of Biological Regulation, Weizmann Institute of Science, Rehovot, Israel 76100.
Mol Cell Proteomics. 2017 Jun;16(6):1151-1161. doi: 10.1074/mcp.O116.065417. Epub 2017 Mar 27.
Traditional "bottom-up" proteomic approaches use proteolytic digestion, LC-MS/MS, and database searching to elucidate peptide identities and their parent proteins. Protein sequences absent from the database cannot be identified, and even if present in the database, complete sequence coverage is rarely achieved even for the most abundant proteins in the sample. Thus, sequencing of unknown proteins such as antibodies or constituents of metaproteomes remains a challenging problem. To date, there is no available method for full-length protein sequencing, independent of a reference database, in high throughput. Here, we present Database-independent Protein Sequencing, a method for unambiguous, rapid, database-independent, full-length protein sequencing. The method is a novel combination of non-enzymatic, semi-random cleavage of the protein, LC-MS/MS analysis, peptide sequencing, extraction of peptide tags, and their assembly into a consensus sequence using an algorithm named "Peptide Tag Assembler." As proof-of-concept, the method was applied to samples of three known proteins representing three size classes and to a previously un-sequenced, clinically relevant monoclonal antibody. Excluding leucine/isoleucine and glutamic acid/deamidated glutamine ambiguities, end-to-end full-length sequencing was achieved with 99-100% accuracy for all benchmarking proteins and the antibody light chain. Accuracy of the sequenced antibody heavy chain, including the entire variable region, was also 100%, but there was a 23-residue gap in the constant region sequence.
传统的“自下而上”蛋白质组学方法利用蛋白酶解、液相色谱-串联质谱(LC-MS/MS)和数据库搜索来阐明肽段的身份及其母蛋白。数据库中不存在的蛋白质序列无法被识别,而且即使存在于数据库中,对于样品中最丰富的蛋白质,也很少能实现完整的序列覆盖。因此,对未知蛋白质(如抗体或宏蛋白质组的成分)进行测序仍然是一个具有挑战性的问题。迄今为止,还没有一种高通量的、独立于参考数据库的全长蛋白质测序方法。在这里,我们提出了独立于数据库的蛋白质测序方法,这是一种用于明确、快速、独立于数据库的全长蛋白质测序的方法。该方法是蛋白质的非酶促、半随机切割、LC-MS/MS分析、肽段测序、肽段标签提取以及使用名为“肽段标签组装器”的算法将它们组装成一致序列的新颖组合。作为概念验证,该方法应用于代表三种大小类别的三种已知蛋白质的样品以及一种先前未测序的、临床相关的单克隆抗体。排除亮氨酸/异亮氨酸和谷氨酸/脱酰胺谷氨酰胺的模糊性后,所有基准蛋白质和抗体轻链均以99 - 100%的准确率实现了端到端的全长测序。测序的抗体重链(包括整个可变区)的准确率也为100%,但恒定区序列中有一个23个残基的缺口。