Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China.
School of Computing, College of Engineering, Computing and Cybernetics, The Australian National University, Canberra, ACT, 2600, Australia.
Microbiome. 2024 Mar 19;12(1):58. doi: 10.1186/s40168-024-01775-3.
Microbiota are closely associated with human health and disease. Metaproteomics can provide a direct means to identify microbial proteins in microbiota for compositional and functional characterization. However, in-depth and accurate metaproteomics is still limited due to the extreme complexity and high diversity of microbiota samples. It is generally recommended to use metagenomic data from the same samples to construct the protein sequence database for metaproteomic data analysis. Although different metagenomics-based database construction strategies have been developed, an optimization of gene taxonomic annotation has not been reported, which, however, is extremely important for accurate metaproteomic analysis.
Herein, we proposed an accurate taxonomic annotation pipeline for genes from metagenomic data, namely contigs directed gene annotation (ConDiGA), and used the method to build a protein sequence database for metaproteomic analysis. We compared our pipeline (ConDiGA or MD3) with two other popular annotation pipelines (MD1 and MD2). In MD1, genes were directly annotated against the whole bacterial genome database; in MD2, contigs were annotated against the whole bacterial genome database and the taxonomic information of contigs was assigned to the genes; in MD3, the most confident species from the contigs annotation results were taken as reference to annotate genes. Annotation tools, including BLAST, Kaiju, and Kraken2, were compared. Based on a synthetic microbial community of 12 species, it was found that Kaiju with the MD3 pipeline outperformed the others in the construction of protein sequence database from metagenomic data. Similar performance was also observed with a fecal sample, as well as in silico mixed datasets of the simulated microbial community and the fecal sample.
Overall, we developed an optimized pipeline for gene taxonomic annotation to construct protein sequence databases. Our study can tackle the current taxonomic annotation reliability problem in metagenomics-derived protein sequence database and can promote the in-depth metaproteomic analysis of microbiome. The unique metagenomic and metaproteomic datasets of the 12 bacterial species are publicly available as a standard benchmarking sample for evaluating various analysis pipelines. The code of ConDiGA is open access at GitHub for the analysis of microbiota samples. Video Abstract.
微生物群与人类健康和疾病密切相关。代谢蛋白质组学可以提供一种直接的方法来识别微生物群中的微生物蛋白,从而进行组成和功能表征。然而,由于微生物群样本的极端复杂性和高度多样性,深入和准确的代谢蛋白质组学仍然受到限制。通常建议使用来自相同样本的宏基因组数据来构建用于代谢蛋白质组数据分析的蛋白质序列数据库。尽管已经开发了不同的基于宏基因组的数据库构建策略,但尚未报道基因分类注释的优化,然而,这对于准确的代谢蛋白质组分析至关重要。
本文提出了一种用于宏基因组数据中基因的准确分类注释管道,即基于 contigs 的基因注释(ConDiGA),并使用该方法构建了用于代谢蛋白质组分析的蛋白质序列数据库。我们将我们的管道(ConDiGA 或 MD3)与另外两种流行的注释管道(MD1 和 MD2)进行了比较。在 MD1 中,基因直接针对整个细菌基因组数据库进行注释;在 MD2 中,contigs 针对整个细菌基因组数据库进行注释,并且 contigs 的分类信息被分配给基因;在 MD3 中,从 contigs 注释结果中最可信的物种被用作注释基因的参考。比较了注释工具,包括 BLAST、Kaiju 和 Kraken2。基于 12 个物种的合成微生物群落,发现使用 MD3 管道的 Kaiju 在从宏基因组数据构建蛋白质序列数据库方面优于其他方法。同样的性能也在粪便样本以及模拟微生物群落和粪便样本的混合数据集上得到了观察。
总之,我们开发了一种优化的基因分类注释管道,用于构建蛋白质序列数据库。我们的研究可以解决宏基因组衍生蛋白质序列数据库中当前的分类注释可靠性问题,并促进微生物组的深入代谢蛋白质组学分析。12 个细菌物种的独特宏基因组和代谢蛋白质组学数据集作为评估各种分析管道的标准基准样本公开可用。ConDiGA 的代码在 GitHub 上可公开获取,用于分析微生物群样本。视频摘要。