Goyat Hemant, Singh Dalwinder, Paliyal Sunaina, Mantri Shrikant
Computational Biology Lab, National Agri-Food Biotechnology Institute, Sector 81, SAS Nagar, Punjab 140308, India.
Department of Anatomy and Cell Biology, Western University, London, Ontario N6A 3K7, Canada.
J Chem Inf Model. 2025 Jul 14;65(13):7193-7208. doi: 10.1021/acs.jcim.5c00465. Epub 2025 Jun 16.
Microorganisms such as bacteria and fungi have been used for natural products that translate to drugs. However, assessing the bioactivity of extract from culture to identify novel natural molecules remains a strenuous process due to the cumbersome order of production, purification, and assaying. Thus, extensive genome mining of microbiomes is underway to identify biosynthetic gene clusters or BGCs that can be profiled as particular natural products, and computational methods have been developed to address this problem using machine learning. However, existing tools are ineffective due to a small training data set, dependence on old genome mining tools, lack of relevant genomic descriptors, and prevalent class imbalance. This work presents a new tool, NPBdetect, that can detect multiple bioactivities and has been designed through rigorous experiments. First, we composed a larger training set using the MIBiG database and a test set through literature mining to build and assess the model, respectively. Second, the latest antiSMASH genome mining tool was used to obtain BGCs and introduced new sequence-based descriptors. Third, neural networks are used to build the model by dealing with class imbalance issues through the class weighting technique. Finally, we compared the NPBdetect tool with an existing tool to show its efficacy and real-world utility in detecting several bioactivities with high confidence.
诸如细菌和真菌等微生物已被用于生产可转化为药物的天然产物。然而,由于生产、纯化和检测的繁琐流程,评估培养物提取物的生物活性以鉴定新型天然分子仍然是一个艰巨的过程。因此,目前正在对微生物群落进行广泛的基因组挖掘,以识别可被表征为特定天然产物的生物合成基因簇(BGC),并且已经开发了使用机器学习来解决这一问题的计算方法。然而,由于训练数据集小、依赖旧的基因组挖掘工具、缺乏相关的基因组描述符以及普遍存在的类不平衡问题,现有工具效率不高。这项工作提出了一种新工具NPBdetect,它可以检测多种生物活性,并且是通过严格的实验设计出来的。首先,我们使用MIBiG数据库组成了一个更大的训练集,并通过文献挖掘组成了一个测试集,分别用于构建和评估模型。其次,使用最新的antiSMASH基因组挖掘工具来获取BGC,并引入了基于序列的新描述符。第三,通过类加权技术处理类不平衡问题,使用神经网络构建模型。最后,我们将NPBdetect工具与现有工具进行比较,以展示其在高置信度检测多种生物活性方面的有效性和实际效用。