Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
J Mol Biol. 2022 Aug 15;434(15):167597. doi: 10.1016/j.jmb.2022.167597. Epub 2022 May 6.
Biosynthetic gene clusters (BGCs) in bacterial genomes code for important small molecules and secondary metabolites. Based on the validated BGCs and the corresponding sequences of protein family domains (Pfams), Pfam functions and clan information, we develop a deep learning method e-DeepBGC, that extends DeepBGC, for detecting the BGCs and their biosynthetic class in bacterial genomes. We show that e-DeepBGC leads to reduced false positive rates in BGC identification and an increased sensitivity in identifying BGCs compared to DeepBGC. We apply e-DeepBGC to 5,666 Ref Seq bacterial genomes and detect a total of 170, 685 BGCs with an average of 30.1 BGCs in each genome. We summarize all the predicted BGCs, their functional classes and the distributions of the BGCs in different bacterial phyla.
细菌基因组中的生物合成基因簇 (BGC) 编码重要的小分子和次生代谢物。基于已验证的 BGC 以及蛋白家族结构域 (Pfam) 的相应序列、 Pfam 功能和族信息,我们开发了一种深度学习方法 e-DeepBGC,对细菌基因组中的 BGC 及其生物合成类别进行检测。我们表明,与 DeepBGC 相比,e-DeepBGC 可以降低 BGC 识别中的假阳性率,并提高识别 BGC 的灵敏度。我们将 e-DeepBGC 应用于 5666 个 Ref Seq 细菌基因组,共检测到 170685 个 BGC,每个基因组平均有 30.1 个 BGC。我们总结了所有预测的 BGC 及其功能类别,以及它们在不同细菌门中的分布情况。