Area Science Park, Padriciano, 99, 34149, Trieste, Italy.
University of Trieste, Trieste, 34127, Italy.
Sci Data. 2024 Jun 1;11(1):568. doi: 10.1038/s41597-024-03131-4.
Technological advances in massively parallel sequencing have led to an exponential growth in the number of known protein sequences. Much of this growth originates from metagenomic projects producing new sequences from environmental and clinical samples. The Unified Human Gastrointestinal Proteome (UHGP) catalogue is one of the most relevant metagenomic datasets with applications ranging from medicine to biology. However, the low levels of sequence annotation may impair its usability. This work aims to produce a family classification of UHGP sequences to facilitate downstream structural and functional annotation. This is achieved through the release of the DPCfam-UHGP50 dataset containing 10,778 putative protein families generated using DPCfam clustering, an unsupervised pipeline grouping sequences into single or multi-domain architectures. DPCfam-UHGP50 considerably improves family coverage at protein and residue levels compared to the manually curated repository Pfam. In the hope that DPCfam-UHGP50 will foster future discoveries in the field of metagenomics of the human gut, we release a FAIR-compliant database of our results that is easily accessible via a searchable web server and Zenodo repository.
高通量测序技术的进步使得已知蛋白质序列的数量呈指数级增长。其中大部分增长来自于宏基因组项目,这些项目从环境和临床样本中产生新的序列。统一的人类胃肠道蛋白质组 (UHGP) 目录是最相关的宏基因组数据集之一,其应用范围从医学到生物学。然而,序列注释的低水平可能会影响其可用性。这项工作旨在对 UHGP 序列进行家族分类,以促进下游的结构和功能注释。这是通过发布 DPCfam-UHGP50 数据集来实现的,该数据集包含了 10778 个可能的蛋白质家族,这些家族是使用 DPCfam 聚类生成的,这是一个无监督的管道,将序列分为单域或多域结构。与经过人工注释的 Pfam 数据库相比,DPCfam-UHGP50 在蛋白质和残基水平上显著提高了家族覆盖率。我们希望 DPCfam-UHGP50 能够促进人类肠道宏基因组学领域的未来发现,因此我们发布了一个符合 FAIR 原则的数据库,通过可搜索的网络服务器和 Zenodo 存储库可以轻松访问我们的结果。