Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354, United States.
Signature Sciences and Technology Division, Pacific Northwest National Laboratory, Richland, Washington 99354, United States.
J Proteome Res. 2022 Aug 5;21(8):2023-2035. doi: 10.1021/acs.jproteome.2c00334. Epub 2022 Jul 6.
Metaproteomics has been increasingly utilized for high-throughput characterization of proteins in complex environments and has been demonstrated to provide insights into microbial composition and functional roles. However, significant challenges remain in metaproteomic data analysis, including creation of a sample-specific protein sequence database. A well-matched database is a requirement for successful metaproteomics analysis, and the accuracy and sensitivity of PSM identification algorithms suffer when the database is incomplete or contains extraneous sequences. When matched DNA sequencing data of the sample is unavailable or incomplete, creating the proteome database that accurately represents the organisms in the sample is a challenge. Here, we leverage a peptide sequencing approach to identify the sample composition directly from metaproteomic data. First, we created a deep learning model, Kaiko, to predict the peptide sequences from mass spectrometry data and trained it on 5 million peptide-spectrum matches from 55 phylogenetically diverse bacteria. After training, Kaiko successfully identified organisms from soil isolates and synthetic communities directly from proteomics data. Finally, we created a pipeline for metaproteome database generation using Kaiko. We tested the pipeline on native soils collected in Kansas, showing that the sequencing model can be employed as an alternative and complementary method to construct the sample-specific protein database instead of relying on (un)matched metagenomes. Our pipeline identified all highly abundant taxa from 16S rRNA sequencing of the soil samples and uncovered several additional species which were strongly represented only in proteomic data.
蛋白质组学已经越来越多地被用于高通量分析复杂环境中的蛋白质,并已被证明可以深入了解微生物的组成和功能角色。然而,蛋白质组学数据分析仍然存在重大挑战,包括创建特定于样本的蛋白质序列数据库。一个匹配良好的数据库是成功进行蛋白质组学分析的前提,如果数据库不完整或包含无关序列,那么肽段匹配(PSM)鉴定算法的准确性和灵敏度就会受到影响。当样本的匹配 DNA 测序数据不可用或不完整时,创建准确代表样本中生物体的蛋白质组数据库是一个挑战。在这里,我们利用肽段测序方法直接从蛋白质组学数据中识别样本组成。首先,我们创建了一个深度学习模型 Kaiko,用于从质谱数据中预测肽段序列,并在来自 55 种系统发育多样化细菌的 500 万个肽段-谱匹配数据上对其进行了训练。在训练后,Kaiko 成功地直接从蛋白质组学数据中识别出土壤分离物和人工合成群落中的生物体。最后,我们使用 Kaiko 为蛋白质组数据库生成创建了一个工作流程。我们在堪萨斯州采集的天然土壤上对该工作流程进行了测试,结果表明,该测序模型可以作为构建特定于样本的蛋白质数据库的替代和补充方法,而无需依赖(不)匹配的宏基因组。我们的工作流程鉴定了土壤样本 16S rRNA 测序中所有高度丰富的分类群,并发现了仅在蛋白质组数据中强烈代表的几个其他物种。