Suppr超能文献

深度 NEC:一种新颖的无对齐工具,用于使用深度学习识别和分类与氮生化网络相关的酶。

deepNEC: a novel alignment-free tool for the identification and classification of nitrogen biochemical network-related enzymes using deep learning.

机构信息

Department of Plants, Soils, and Climate, College of Agriculture and Applied Sciences, UT 84322 USA.

Bioinformatics Facility, Center for Integrated BioSystems, UT 84322 USA.

出版信息

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac071.

Abstract

Nitrogen is essential for life and its transformations are an important part of the global biogeochemical cycle. Being an essential nutrient, nitrogen exists in a range of oxidation states from +5 (nitrate) to -3 (ammonium and amino-nitrogen), and its oxidation and reduction reactions catalyzed by microbial enzymes determine its environmental fate. The functional annotation of the genes encoding the core nitrogen network enzymes has a broad range of applications in metagenomics, agriculture, wastewater treatment and industrial biotechnology. This study developed an alignment-free computational approach to determine the predicted nitrogen biochemical network-related enzymes from the sequence itself. We propose deepNEC, a novel end-to-end feature selection and classification model training approach for nitrogen biochemical network-related enzyme prediction. The algorithm was developed using Deep Learning, a class of machine learning algorithms that uses multiple layers to extract higher-level features from the raw input data. The derived protein sequence is used as an input, extracting sequential and convolutional features from raw encoded protein sequences based on classification rather than traditional alignment-based methods for enzyme prediction. Two large datasets of protein sequences, enzymes and non-enzymes were used to train the models with protein sequence features like amino acid composition, dipeptide composition (DPC), conformation transition and distribution, normalized Moreau-Broto (NMBroto), conjoint and quasi order, etc. The k-fold cross-validation and independent testing were performed to validate our model training. deepNEC uses a four-tier approach for prediction; in the first phase, it will predict a query sequence as enzyme or non-enzyme; in the second phase, it will further predict and classify enzymes into nitrogen biochemical network-related enzymes or non-nitrogen metabolism enzymes; in the third phase, it classifies predicted enzymes into nine nitrogen metabolism classes; and in the fourth phase, it predicts the enzyme commission number out of 20 classes for nitrogen metabolism. Among all, the DPC + NMBroto hybrid feature gave the best prediction performance (accuracy of 96.15% in k-fold training and 93.43% in independent testing) with an Matthews correlation coefficient (0.92 training and 0.87 independent testing) in phase I; phase II (accuracy of 99.71% in k-fold training and 98.30% in independent testing); phase III (overall accuracy of 99.03% in k-fold training and 98.98% in independent testing); phase IV (overall accuracy of 99.05% in k-fold training and 98.18% in independent testing), the DPC feature gave the best prediction performance. We have also implemented a homology-based method to remove false negatives. All the models have been implemented on a web server (prediction tool), which is freely available at http://bioinfo.usu.edu/deepNEC/.

摘要

氮是生命所必需的,其转化是全球生物地球化学循环的重要组成部分。作为一种必需的营养物质,氮存在于一系列氧化态,从+5(硝酸盐)到-3(氨和氨基氮),其氧化还原反应由微生物酶催化,决定了其环境命运。编码核心氮网络酶的基因的功能注释在宏基因组学、农业、废水处理和工业生物技术中有广泛的应用。本研究开发了一种无比对的计算方法,从序列本身确定预测的氮生化网络相关酶。我们提出了 deepNEC,这是一种新颖的端到端特征选择和分类模型训练方法,用于预测氮生化网络相关酶。该算法是使用深度学习开发的,深度学习是一类机器学习算法,它使用多层从原始输入数据中提取更高层次的特征。衍生的蛋白质序列被用作输入,根据分类从原始编码的蛋白质序列中提取序列和卷积特征,而不是传统的基于比对的方法进行酶预测。使用两个大型蛋白质序列数据集,酶和非酶,使用蛋白质序列特征,如氨基酸组成、二肽组成(DPC)、构象转换和分布、归一化 Moreau-Broto(NMBroto)、联合和准序等,对模型进行训练。采用 k 折交叉验证和独立测试对模型训练进行验证。deepNEC 使用四层方法进行预测;在第一阶段,它将预测查询序列为酶或非酶;在第二阶段,它将进一步预测并将酶分类为氮生化网络相关酶或非氮代谢酶;在第三阶段,它将预测的酶分类为九个氮代谢类;在第四阶段,它将从 20 个氮代谢类中预测酶的酶委员会编号。在所有这些特征中,DPC+NMBroto 混合特征在第一阶段(k 折训练的准确性为 96.15%,独立测试的准确性为 93.43%)和第二阶段(k 折训练的准确性为 99.71%,独立测试的准确性为 98.30%)中具有最佳的预测性能,马氏相关系数(k 折训练为 0.92,独立测试为 0.87);在第三阶段(k 折训练的整体准确性为 99.03%,独立测试的准确性为 98.98%);在第四阶段(k 折训练的整体准确性为 99.05%,独立测试的准确性为 98.18%),DPC 特征具有最佳的预测性能。我们还实现了一种基于同源性的方法来去除假阴性。所有模型都已在一个网络服务器(预测工具)上实现,该服务器可在 http://bioinfo.usu.edu/deepNEC/ 免费获得。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验