State key Laboratory of Genetic Engineering, Institute of Plant Biology, School of Life Sciences, Fudan University, Shanghai 200433, China.
The T-Life Research Center, Fudan University, Shanghai 200433, China.
Nucleic Acids Res. 2019 Mar 18;47(5):e30. doi: 10.1093/nar/gkz017.
Metagenomic studies, greatly promoted by the fast development of next-generation sequencing (NGS) technologies, uncover complex structures of microbial communities and their interactions with environment. As the majority of microbes lack information of genome sequences, it is essential to assemble prokaryotic genomes ab initio aiming to retrieve complete coding genes from various metabolic pathways. The complex nature of microbial composition and the burden of handling a vast amount of metagenomic data, bring great challenges to the development of effective and efficient bioinformatic tools. Here we present a protein assembler (MetaPA), based on de Bruijn graph searching on oligopeptide spaces and can be applied on both metagenomic and metatranscriptomic sequencing data. When public homologous protein sequences are involved to guide the assembling procedures, MetaPA assembles 85% of total proteins in complete sequences with high precision of 83% on real high-throughput sequencing datasets. Application of MetaPA on metatranscriptomic data successfully identifies the majority of actively transcribed genes validated in related studies. The results suggest that MetaPA has a good potential in both metagenomic and metatranscriptomic studies to characterize the composition and abundance of microbiota.
宏基因组学研究极大地促进了下一代测序(NGS)技术的发展,揭示了微生物群落的复杂结构及其与环境的相互作用。由于大多数微生物缺乏基因组序列信息,因此必须从头组装原核基因组,以便从各种代谢途径中检索完整的编码基因。微生物组成的复杂性和大量宏基因组数据处理的负担给有效和高效的生物信息学工具的发展带来了巨大的挑战。在这里,我们提出了一种基于寡肽空间的 de Bruijn 图搜索的蛋白质组装器(MetaPA),可应用于宏基因组和宏转录组测序数据。当涉及到公共同源蛋白序列来指导组装过程时,MetaPA 可以在真实的高通量测序数据集中以 83%的高精度组装 85%的完整序列中的总蛋白。MetaPA 在宏转录组数据上的应用成功地识别了相关研究中验证的大多数活跃转录基因。结果表明,MetaPA 在宏基因组学和宏转录组学研究中具有很好的潜力,可以描述微生物群落的组成和丰度。