Center of Computational Molecular Biology Brown University, Providence, RI, USA.
Center for Biomedical Informatics Brown University, Providence, RI, USA.
Bioinformatics. 2022 Sep 2;38(17):4172-4177. doi: 10.1093/bioinformatics/btac487.
Microbiome datasets are often constrained by sequencing limitations. GenBank is the largest collection of publicly available DNA sequences, which is maintained by the National Center of Biotechnology Information (NCBI). The metadata of GenBank records are a largely understudied resource and may be uniquely leveraged to access the sum of prior studies focused on microbiome composition. Here, we developed a computational pipeline to analyze GenBank metadata, containing data on hosts, microorganisms and their place of origin. This work provides the first opportunity to leverage the totality of GenBank to shed light on compositional data practices that shape how microbiome datasets are formed as well as examine host-microbiome relationships.
The collected dataset contains multiple kingdoms of microorganisms, consisting of bacteria, viruses, archaea, protozoa, fungi, and invertebrate parasites, and hosts of multiple taxonomical classes, including mammals, birds and fish. A human data subset of this dataset provides insights to gaps in current microbiome data collection, which is biased towards clinically relevant pathogens. Clustering and phylogenic analysis reveals the potential to use these data to model host taxonomy and evolution, revealing groupings formed by host diet, environment and coevolution.
GenBank Host-Microbiome Pipeline is available at https://github.com/bcbi/genbank_holobiome. The GenBank loader is available at https://github.com/bcbi/genbank_loader.
Supplementary data are available at Bioinformatics online.
微生物组数据集通常受到测序限制的约束。GenBank 是最大的公共 DNA 序列集合,由国家生物技术信息中心(NCBI)维护。GenBank 记录的元数据是一个很大程度上未被充分研究的资源,并且可以独特地利用这些资源来访问之前专注于微生物组组成的研究的总和。在这里,我们开发了一种计算管道来分析 GenBank 元数据,其中包含有关宿主、微生物及其来源地的数据。这项工作首次提供了利用 GenBank 的全部内容来揭示塑造微生物组数据集形成方式的组成数据实践的机会,并检查宿主-微生物组关系。
收集的数据集包含多个微生物王国,包括细菌、病毒、古细菌、原生动物、真菌和无脊椎寄生虫,以及多个分类类别的宿主,包括哺乳动物、鸟类和鱼类。该数据集的人类数据子集提供了对当前微生物组数据收集存在偏见的见解,这些数据偏向于临床相关病原体。聚类和系统发育分析揭示了利用这些数据来模拟宿主分类和进化的潜力,揭示了由宿主饮食、环境和共同进化形成的分组。
GenBank 宿主-微生物组管道可在 https://github.com/bcbi/genbank_holobiome 上获得。GenBank 加载器可在 https://github.com/bcbi/genbank_loader 上获得。
补充数据可在 Bioinformatics 在线获得。