Using genetic algorithms to select most predictive protein features.

Suppr

超能文献

作者信息

Kernytsky Andrew, Rost Burkhard

机构信息

Department of Biochemistry and Molecular Biophysics, Columbia University, New York 10032, New York, USA.

出版信息

Proteins. 2009 Apr;75(1):75-88. doi: 10.1002/prot.22211.

DOI:10.1002/prot.22211

PMID:18798568

Abstract

Many important characteristics of proteins such as biochemical activity and subcellular localization present a challenge to machine-learning methods: it is often difficult to encode the appropriate input features at the residue level for the purpose of making a prediction for the entire protein. The problem is usually that the biophysics of the connection between a machine-learning method's input (sequence feature) and its output (observed phenomenon to be predicted) remains unknown; in other words, we may only know that a certain protein is an enzyme (output) without knowing which region may contain the active site residues (input). The goal then becomes to dissect a protein into a vast set of sequence-derived features and to correlate those features with the desired output. We introduce a framework that begins with a set of global sequence features and then vastly expands the feature space by generically encoding the coexistence of residue-based features. It is this combination of individual features, that is the step from the fractions of serine and buried (input space 20 + 2) to the fraction of buried serine (input space 20 * 2) that implicitly shifts the search space from global feature inputs to features that can capture very local evidence such as a the individual residues of a catalytic triad. The vast feature space created is explored by a genetic algorithm (GA) paired with neural networks and support vector machines. We find that the GA is critical for selecting combinations of features that are neither too general resulting in poor performance, nor too specific, leading to overtraining. The final framework manages to effectively sample a feature space that is far too large for exhaustive enumeration. We demonstrate the power of the concept by applying it to prediction of protein enzymatic activity.

摘要

相似文献

Using genetic algorithms to select most predictive protein features.

Proteins. 2009 Apr;75(1):75-88. doi: 10.1002/prot.22211.

Prediction of protein binding sites in protein structures using hidden Markov support vector machine.利用隐马尔可夫支持向量机预测蛋白质结构中的蛋白质结合位点。

BMC Bioinformatics. 2009 Nov 20;10:381. doi: 10.1186/1471-2105-10-381.

Prediction of protein subcellular localization.蛋白质亚细胞定位预测

Proteins. 2006 Aug 15;64(3):643-51. doi: 10.1002/prot.21018.

An evolutionary algorithm approach for feature generation from sequence data and its application to DNA splice site prediction.一种从序列数据中生成特征的进化算法方法及其在 DNA 剪接位点预测中的应用。

IEEE/ACM Trans Comput Biol Bioinform. 2012 Sep-Oct;9(5):1387-98. doi: 10.1109/TCBB.2012.53.

Identification of catalytic residues from protein structure using support vector machine with sequence and structural features.利用具有序列和结构特征的支持向量机从蛋白质结构中鉴定催化残基。

Biochem Biophys Res Commun. 2008 Mar 14;367(3):630-4. doi: 10.1016/j.bbrc.2008.01.038. Epub 2008 Jan 17.

Improving prediction accuracy of tumor classification by reusing genes discarded during gene selection.通过重新利用在基因选择过程中被丢弃的基因来提高肿瘤分类的预测准确性。

BMC Genomics. 2008;9 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2164-9-S1-S3.

Identify catalytic triads of serine hydrolases by support vector machines.利用支持向量机识别丝氨酸水解酶的催化三联体。

J Theor Biol. 2004 Jun 21;228(4):551-7. doi: 10.1016/j.jtbi.2004.02.019.

Boosting phosphorylation site prediction with sequence feature-based machine learning.基于序列特征的机器学习提高磷酸化位点预测。

Proteins. 2020 Feb;88(2):284-291. doi: 10.1002/prot.25801. Epub 2019 Aug 22.

Protein Contact Map Prediction Based on ResNet and DenseNet.基于 ResNet 和 DenseNet 的蛋白质接触图预测。

Biomed Res Int. 2020 Apr 6;2020:7584968. doi: 10.1155/2020/7584968. eCollection 2020.

Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.基于超深度学习模型的蛋白质接触图从头精确预测

PLoS Comput Biol. 2017 Jan 5;13(1):e1005324. doi: 10.1371/journal.pcbi.1005324. eCollection 2017 Jan.

引用本文的文献

Contrastive learning on protein embeddings enlightens midnight zone.蛋白质嵌入的对比学习照亮了午夜区。

NAR Genom Bioinform. 2022 Jun 11;4(2):lqac043. doi: 10.1093/nargab/lqac043. eCollection 2022 Jun.

Encodings and models for antimicrobial peptide classification for multi-resistant pathogens.用于多重耐药病原体抗菌肽分类的编码与模型

BioData Min. 2019 Mar 4;12:7. doi: 10.1186/s13040-019-0196-x. eCollection 2019.

Effective automated feature construction and selection for classification of biological sequences.用于生物序列分类的有效自动特征构建与选择

PLoS One. 2014 Jul 17;9(7):e99982. doi: 10.1371/journal.pone.0099982. eCollection 2014.

Automatic quantitative MRI texture analysis in small-for-gestational-age fetuses discriminates abnormal neonatal neurobehavior.小胎龄儿胎儿自动定量 MRI 纹理分析可区分异常新生儿神经行为。

PLoS One. 2013 Jul 26;8(7):e69595. doi: 10.1371/journal.pone.0069595. Print 2013.

Improved Bevirimat resistance prediction by combination of structural and sequence-based classifiers.结构和基于序列的分类器组合提高了贝伐单抗耐药性预测。

BioData Min. 2011 Nov 14;4:26. doi: 10.1186/1756-0381-4-26.

Insights into the classification of small GTPases.对小GTP酶分类的见解。

Adv Appl Bioinform Chem. 2010;3:15-24. doi: 10.2147/aabc.s8891. Epub 2010 May 21.

Machine learning on normalized protein sequences.基于标准化蛋白质序列的机器学习。

BMC Res Notes. 2011 Mar 31;4:94. doi: 10.1186/1756-0500-4-94.

Genetic algorithm optimization in drug design QSAR: Bayesian-regularized genetic neural networks (BRGNN) and genetic algorithm-optimized support vectors machines (GA-SVM).遗传算法在药物设计 QSAR 中的优化：贝叶斯正则化遗传神经网络 (BRGNN) 和遗传算法优化支持向量机 (GA-SVM)。

Mol Divers. 2011 Feb;15(1):269-89. doi: 10.1007/s11030-010-9234-9. Epub 2010 Mar 20.

Predicting Bevirimat resistance of HIV-1 from genotype.从基因型预测 HIV-1 对贝维立姆的耐药性。

BMC Bioinformatics. 2010 Jan 20;11:37. doi: 10.1186/1471-2105-11-37.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验