Nafi Md Muhaiminul Islam, Mohaimin Abdullah Al
Department of CSE, BUET, Dhaka 1000, Bangladesh.
Department of CSE, United International University (UIU), Dhaka 1212, Bangladesh.
Bioinform Adv. 2025 Aug 23;5(1):vbaf204. doi: 10.1093/bioadv/vbaf204. eCollection 2025.
Heavy usage of synthetic nitrogen fertilizers to satisfy the increasing demands for food has led to severe environmental impacts like decreasing crop yields and eutrophication. One promising alternative is using nitrogen-fixing microorganisms as biofertilizers, which use the nitrogenase enzyme. This could also be achieved by expressing a functional nitrogenase enzyme in the cells of the cereal crops.
In this study, we predicted microbial strains with a high potential for nitrogenase activity using machine learning techniques. Its objective was to enable the screening and ranking of potential strains based on genomic information. We explored several protein language model embeddings for this prediction task and built two stacking ensemble models. One of them, NFEmbed-C, used k-Nearest Neighbors and Random Forest as base and meta learners, respectively. The other one, NFEmbed-R, combined Decision Tree Regressor and eXtreme Gradient Boosting Regressor as base learners, with Support Vector Regressor as the meta learner. On the Test set, both NFEmbed-C and NFEmbed-R performed better than the state-of-the-art methods with improvements ranging from 0% to 11.2% and from 30% to 51%, respectively. While NFEmbed-R got a 0.783 score, 0.158 MSE, and 0.398 RMSE, NFEmbed-C acquired 0.949 sensitivity, 0.892 F1 score, and 0.784 Matthews Correlation Coefficient on the test set.
We performed our analysis in Python; code is available at https://github.com/nafcoder/NFEmbed.
大量使用合成氮肥以满足不断增长的粮食需求,已导致诸如作物产量下降和富营养化等严重环境影响。一种有前景的替代方法是使用固氮微生物作为生物肥料,这些微生物利用固氮酶。这也可以通过在谷类作物细胞中表达功能性固氮酶来实现。
在本研究中,我们使用机器学习技术预测具有高固氮酶活性潜力的微生物菌株。其目的是基于基因组信息对潜在菌株进行筛选和排名。我们针对此预测任务探索了几种蛋白质语言模型嵌入,并构建了两个堆叠集成模型。其中一个,NFEmbed-C,分别使用k近邻和随机森林作为基学习器和元学习器。另一个,NFEmbed-R,将决策树回归器和极端梯度提升回归器组合作为基学习器,支持向量回归器作为元学习器。在测试集上,NFEmbed-C和NFEmbed-R的表现均优于现有方法,改进幅度分别为0%至11.2%和30%至51%。NFEmbed-R在测试集上的得分为0.783、均方误差为(0.158)、均方根误差为(0.398),而NFEmbed-C在测试集上的灵敏度为(0.949)、F1分数为(0.892)、马修斯相关系数为(0.784)。
我们用Python进行了分析;代码可在https://github.com/nafcoder/NFEmbed获取。