一种用于改进蛋白质结构类预测的特征与算法选择方法

A Feature and Algorithm Selection Method for Improving the Prediction of Protein Structural Class.

作者信息

Ni Qianwu, Chen Lei

机构信息

College of Information Engineering, Shanghai Maritime University, Shanghai 201306. China.

出版信息

Comb Chem High Throughput Screen. 2017;20(7):612-621. doi: 10.2174/1386207320666170314103147.

DOI:10.2174/1386207320666170314103147

PMID:28292249

Abstract

AIM AND OBJECTIVE

Correct prediction of protein structural class is beneficial to investigation on protein functions, regulations and interactions. In recent years, several computational methods have been proposed in this regard. However, based on various features, it is still a great challenge to select proper classification algorithm and extract essential features to participate in classification.

MATERIAL AND METHODS

In this study, a feature and algorithm selection method was presented for improving the accuracy of protein structural class prediction. The amino acid compositions and physiochemical features were adopted to represent features and thirty-eight machine learning algorithms collected in Weka were employed. All features were first analyzed by a feature selection method, minimum redundancy maximum relevance (mRMR), producing a feature list. Then, several feature sets were constructed by adding features in the list one by one. For each feature set, thirtyeight algorithms were executed on a dataset, in which proteins were represented by features in the set. The predicted classes yielded by these algorithms and true class of each protein were collected to construct a dataset, which were analyzed by mRMR method, yielding an algorithm list. From the algorithm list, the algorithm was taken one by one to build an ensemble prediction model. Finally, we selected the ensemble prediction model with the best performance as the optimal ensemble prediction model.

RESULTS

Experimental results indicate that the constructed model is much superior to models using single algorithm and other models that only adopt feature selection procedure or algorithm selection procedure.

CONCLUSION

The feature selection procedure or algorithm selection procedure are really helpful for building an ensemble prediction model that can yield a better performance.

摘要

目的与目标

准确预测蛋白质结构类别有助于对蛋白质功能、调控及相互作用进行研究。近年来，针对这方面已提出了多种计算方法。然而，基于各种特征，选择合适的分类算法并提取关键特征以参与分类仍是一项巨大挑战。

材料与方法

在本研究中，提出了一种特征与算法选择方法以提高蛋白质结构类别预测的准确性。采用氨基酸组成和理化特征来表示特征，并使用了在Weka中收集的38种机器学习算法。所有特征首先通过一种特征选择方法——最小冗余最大相关度（mRMR）进行分析，生成一个特征列表。然后，通过逐一添加列表中的特征构建了几个特征集。对于每个特征集，在一个数据集中执行38种算法，其中蛋白质由该集合中的特征表示。收集这些算法产生的预测类别以及每个蛋白质的真实类别以构建一个数据集，通过mRMR方法对其进行分析，生成一个算法列表。从算法列表中，逐一选取算法构建一个集成预测模型。最后，我们选择性能最佳的集成预测模型作为最优集成预测模型。

结果

实验结果表明，构建的模型远优于使用单一算法的模型以及仅采用特征选择过程或算法选择过程的其他模型。

结论

特征选择过程或算法选择过程对于构建性能更好的集成预测模型确实有帮助。

相似文献

A Feature and Algorithm Selection Method for Improving the Prediction of Protein Structural Class.一种用于改进蛋白质结构类预测的特征与算法选择方法

Comb Chem High Throughput Screen. 2017;20(7):612-621. doi: 10.2174/1386207320666170314103147.

Analysis and Prediction of Myristoylation Sites Using the mRMR Method, the IFS Method and an Extreme Learning Machine Algorithm.使用最大相关最小冗余（mRMR）方法、迭代特征选择（IFS）方法和极限学习机算法对肉豆蔻酰化位点进行分析与预测

Comb Chem High Throughput Screen. 2017;20(2):96-106. doi: 10.2174/1386207319666161220114424.

Prediction of Nitrated Tyrosine Residues in Protein Sequences by Extreme Learning Machine and Feature Selection Methods.基于极限学习机和特征选择方法预测蛋白质序列中的硝化酪氨酸残基

Comb Chem High Throughput Screen. 2018;21(6):393-402. doi: 10.2174/1386207321666180531091619.

Multiple classifier integration for the prediction of protein structural classes.多种分类器集成用于预测蛋白质结构类别。

J Comput Chem. 2009 Nov 15;30(14):2248-54. doi: 10.1002/jcc.21230.

Protein-protein interface hot spots prediction based on a hybrid feature selection strategy.基于混合特征选择策略的蛋白质-蛋白质界面热点预测。

BMC Bioinformatics. 2018 Jan 15;19(1):14. doi: 10.1186/s12859-018-2009-5.

Using Recursive Feature Selection with Random Forest to Improve Protein Structural Class Prediction for Low-Similarity Sequences.使用递归特征选择和随机森林提高低相似度序列的蛋白质结构分类预测。

Comput Math Methods Med. 2021 May 7;2021:5529389. doi: 10.1155/2021/5529389. eCollection 2021.

Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou's PseAAC.通过将基于多目标粒子群优化的特征子集选择纳入周氏伪氨基酸组成的一般形式来预测蛋白质亚细胞定位

Med Biol Eng Comput. 2015 Apr;53(4):331-44. doi: 10.1007/s11517-014-1238-7. Epub 2015 Jan 7.

Sequence-Based Prediction of RNA-Binding Proteins Using Random Forest with Minimum Redundancy Maximum Relevance Feature Selection.基于序列的RNA结合蛋白预测：使用具有最小冗余最大相关特征选择的随机森林算法

Biomed Res Int. 2015;2015:425810. doi: 10.1155/2015/425810. Epub 2015 Oct 12.

Prediction of protein-protein interactions based on feature selection and data balancing.基于特征选择和数据平衡的蛋白质-蛋白质相互作用预测

Protein Pept Lett. 2013 Mar;20(3):336-45. doi: 10.2174/0929866511320030012.

Minimalist ensemble algorithms for genome-wide protein localization prediction.基因组范围内蛋白质定位预测的简约集成算法。

BMC Bioinformatics. 2012 Jul 3;13:157. doi: 10.1186/1471-2105-13-157.

引用本文的文献

PreAcrs: a machine learning framework for identifying anti-CRISPR proteins.预 Acrs：一种用于识别抗 CRISPR 蛋白的机器学习框架。

BMC Bioinformatics. 2022 Oct 25;23(1):444. doi: 10.1186/s12859-022-04986-3.

Roles of Physicochemical and Structural Properties of RNA-Binding Proteins in Predicting the Activities of Trans-Acting Splicing Factors with Machine Learning.基于物理化学和结构特性的 RNA 结合蛋白在机器学习预测反式剪接因子活性中的作用。

Int J Mol Sci. 2022 Apr 17;23(8):4426. doi: 10.3390/ijms23084426.

Analysis of Protein-Protein Functional Associations by Using Gene Ontology and KEGG Pathway.利用基因本体论和京都基因与基因组百科全书途径分析蛋白质-蛋白质功能关联。

Biomed Res Int. 2019 Jul 18;2019:4963289. doi: 10.1155/2019/4963289. eCollection 2019.

A Computational Method for Classifying Different Human Tissues with Quantitatively Tissue-Specific Expressed Genes.一种利用组织特异性定量表达基因对不同人体组织进行分类的计算方法。

Genes (Basel). 2018 Sep 7;9(9):449. doi: 10.3390/genes9090449.

Identification of the copy number variant biomarkers for breast cancer subtypes.鉴定乳腺癌亚型的拷贝数变异生物标志物。

Mol Genet Genomics. 2019 Feb;294(1):95-110. doi: 10.1007/s00438-018-1488-4. Epub 2018 Sep 10.

Computational Approach to Investigating Key GO Terms and KEGG Pathways Associated with CNV.计算方法研究与 CNV 相关的关键 GO 术语和 KEGG 途径。

Biomed Res Int. 2018 Apr 11;2018:8406857. doi: 10.1155/2018/8406857. eCollection 2018.

Identification of Differentially Expressed Genes between Original Breast Cancer and Xenograft Using Machine Learning Algorithms.使用机器学习算法鉴定原发性乳腺癌与异种移植之间的差异表达基因

Genes (Basel). 2018 Mar 12;9(3):155. doi: 10.3390/genes9030155.

Identifying and analyzing different cancer subtypes using RNA-seq data of blood platelets.利用血小板的RNA测序数据识别和分析不同的癌症亚型。

Oncotarget. 2017 Sep 15;8(50):87494-87511. doi: 10.18632/oncotarget.20903. eCollection 2017 Oct 20.

Discriminating cirRNAs from other lncRNAs using a hierarchical extreme learning machine (H-ELM) algorithm with feature selection.使用带有特征选择的分层极限学习机（H-ELM）算法从其他长链非编码RNA中区分环状RNA。

Mol Genet Genomics. 2018 Feb;293(1):137-149. doi: 10.1007/s00438-017-1372-7. Epub 2017 Sep 14.

Prediction and analysis of essential genes using the enrichments of gene ontology and KEGG pathways.利用基因本体论和KEGG通路的富集对必需基因进行预测和分析。

PLoS One. 2017 Sep 5;12(9):e0184129. doi: 10.1371/journal.pone.0184129. eCollection 2017.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种用于改进蛋白质结构类预测的特征与算法选择方法

A Feature and Algorithm Selection Method for Improving the Prediction of Protein Structural Class.

作者信息

机构信息

出版信息

AIM AND OBJECTIVE

MATERIAL AND METHODS

RESULTS

CONCLUSION

目的与目标

材料与方法

结果

结论

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献