Institute of Marine Science and Technology, Shandong University, Qingdao, China.
Department of Ecology, College of Life Sciences, Zhejiang University, Hangzhou, Zhejiang, China.
Bioinformatics. 2019 Mar 15;35(6):1040-1048. doi: 10.1093/bioinformatics/bty741.
The nitrogen (N) cycle is a collection of important biogeochemical pathways in the Earth ecosystem and has gained extensive foci in ecology and environmental studies. Currently, shotgun metagenome sequencing has been widely applied to explore gene families responsible for N cycle processes. However, there are problems in applying publically available orthology databases to profile N cycle gene families in shotgun metagenomes, such as inefficient database searching, unspecific orthology groups and low coverage of N cycle genes and/or gene (sub)families.
To solve these issues, this study built a manually curated integrative database (NCycDB) for fast and accurate profiling of N cycle gene (sub)families from shotgun metagenome sequencing data. NCycDB contains a total of 68 gene (sub)families and covers eight N cycle processes with 84 759 and 219 146 representative sequences at 95 and 100% identity cutoffs, respectively. We also identified 1958 homologous orthology groups and included corresponding sequences in the database to avoid false positive assignments due to 'small database' issues. We applied NCycDB to characterize N cycle gene (sub)families in 52 shotgun metagenomes from the Global Ocean Sampling expedition. Further analysis showed that the structure and composition of N cycle gene families were most strongly correlated with latitude and temperature. NCycDB is expected to facilitate N cycle studies via shotgun metagenome sequencing approaches in various environments. The framework developed in this study can be served as a good reference to build similar knowledge-based functional gene databases in various processes and pathways.
NCycDB database files are available at https://github.com/qichao1984/NCyc.
Supplementary data are available at Bioinformatics online.
氮 (N) 循环是地球生态系统中一系列重要的生物地球化学途径,在生态学和环境研究中得到了广泛关注。目前, shotgun 宏基因组测序已广泛应用于探索负责 N 循环过程的基因家族。然而,在 shotgun 宏基因组中应用公共同源性数据库来分析 N 循环基因家族时存在一些问题,例如数据库搜索效率低、同源性组不具体以及 N 循环基因和/或基因(亚)家族的覆盖率低。
为了解决这些问题,本研究构建了一个手动 curated 的综合数据库(NCycDB),用于从 shotgun 宏基因组测序数据中快速准确地分析 N 循环基因(亚)家族。NCycDB 共包含 68 个基因(亚)家族,涵盖了 8 个 N 循环过程,分别在 95%和 100%的同一性截断值下,具有 84759 和 219146 个代表性序列。我们还鉴定了 1958 个同源性 orthology 组,并将相应的序列包含在数据库中,以避免由于“小数据库”问题导致的假阳性分配。我们应用 NCycDB 对来自全球海洋采样探险的 52 个 shotgun 宏基因组中的 N 循环基因(亚)家族进行了特征分析。进一步的分析表明,N 循环基因家族的结构和组成与纬度和温度的相关性最强。NCycDB 有望通过各种环境中的 shotgun 宏基因组测序方法来促进 N 循环研究。本研究中开发的框架可以作为在各种过程和途径中构建类似基于知识的功能基因数据库的良好参考。
NCycDB 数据库文件可在 https://github.com/qichao1984/NCyc 上获得。
补充数据可在 Bioinformatics 在线获得。