Department of Chemistry, University of Wisconsin-Madison, Madison, WI, USA.
Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA.
Genome Biol. 2022 Mar 3;23(1):69. doi: 10.1186/s13059-022-02624-y.
The detection of physiologically relevant protein isoforms encoded by the human genome is critical to biomedicine. Mass spectrometry (MS)-based proteomics is the preeminent method for protein detection, but isoform-resolved proteomic analysis relies on accurate reference databases that match the sample; neither a subset nor a superset database is ideal. Long-read RNA sequencing (e.g., PacBio or Oxford Nanopore) provides full-length transcripts which can be used to predict full-length protein isoforms.
We describe here a long-read proteogenomics approach for integrating sample-matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. We introduce a classification scheme for protein isoforms, discover novel protein isoforms, and present the first protein inference algorithm for the direct incorporation of long-read transcriptome data to enable detection of protein isoforms previously intractable to MS-based detection. We have released an open-source Nextflow pipeline that integrates long-read sequencing in a proteomic workflow for isoform-resolved analysis.
Our work suggests that the incorporation of long-read sequencing and proteomic data can facilitate improved characterization of human protein isoform diversity. Our first-generation pipeline provides a strong foundation for future development of long-read proteogenomics and its adoption for both basic and translational research.
检测人类基因组中具有生理相关性的蛋白质同工型对于生物医学至关重要。基于质谱(MS)的蛋白质组学是蛋白质检测的首要方法,但同工型解析的蛋白质组分析依赖于与样本匹配的准确参考数据库;无论是子集数据库还是超集数据库都不理想。长读长 RNA 测序(例如 PacBio 或 Oxford Nanopore)提供全长转录本,可用于预测全长蛋白质同工型。
我们在这里描述了一种长读长蛋白质组学方法,用于整合样本匹配的长读长 RNA-seq 和基于 MS 的蛋白质组学数据,以增强同工型特征分析。我们引入了一种蛋白质同工型分类方案,发现了新的蛋白质同工型,并提出了第一个将长读转录组数据直接纳入的蛋白质推断算法,以实现以前无法通过基于 MS 的检测来检测的蛋白质同工型。我们已经发布了一个开源的 Nextflow 管道,将长读测序整合到蛋白质组学工作流程中,以进行同工型解析分析。
我们的工作表明,长读测序和蛋白质组学数据的整合可以促进人类蛋白质同工型多样性的更好特征分析。我们的第一代流水线为长读蛋白质组学的进一步发展及其在基础和转化研究中的应用提供了坚实的基础。