寻找疾病突变的可解释规则：一种模拟退火凸点搜索策略。

Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy.

作者信息

Jiang Rui, Yang Hua, Sun Fengzhu, Chen Ting

机构信息

Molecular and Computational Biology, University of Southern California, MCB201, 1050 Childs way, Los Angeles, CA 90089-2910, USA.

出版信息

BMC Bioinformatics. 2006 Sep 19;7:417. doi: 10.1186/1471-2105-7-417.

DOI:10.1186/1471-2105-7-417

PMID:16984653

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1618409/

Abstract

BACKGROUND

Understanding how amino acid substitutions affect protein functions is critical for the study of proteins and their implications in diseases. Although methods have been developed for predicting potential effects of amino acid substitutions using sequence, three-dimensional structural, and evolutionary properties of proteins, the applications are limited by the complication of the features and the availability of protein structural information. Another limitation is that the prediction results are hard to be interpreted with physicochemical principles and biological knowledge.

RESULTS

To overcome these limitations, we proposed a novel feature set using physicochemical properties of amino acids, evolutionary profiles of proteins, and protein sequence information. We applied the support vector machine and the random forest with the feature set to experimental amino acid substitutions occurring in the E. coli lac repressor and the bacteriophage T4 lysozyme, as well as to annotated amino acid substitutions occurring in a wide range of human proteins. The results showed that the proposed feature set was superior to the existing ones. To explore physicochemical principles behind amino acid substitutions, we designed a simulated annealing bump hunting strategy to automatically extract interpretable rules for amino acid substitutions. We applied the strategy to annotated human amino acid substitutions and successfully extracted several rules which were either consistent with current biological knowledge or providing new insights for the understanding of amino acid substitutions. When applied to unclassified data, these rules could cover a large portion of samples, and most of the covered samples showed good agreement with predictions made by either the support vector machine or the random forest.

CONCLUSION

The prediction methods using the proposed feature set can achieve larger AUC (the area under the ROC curve), smaller BER (the balanced error rate), and larger MCC (the Matthews' correlation coefficient) than those using the published feature sets, suggesting that our feature set is superior to the existing ones. The rules extracted by the simulated annealing bump hunting strategy have comparable coverage and accuracy but much better interpretability as those extracted by the patient rule induction method (PRIM), revealing that the strategy is more effective in inducing interpretable rules.

摘要

背景

了解氨基酸替换如何影响蛋白质功能对于蛋白质研究及其在疾病中的意义至关重要。尽管已经开发出利用蛋白质的序列、三维结构和进化特性来预测氨基酸替换潜在影响的方法，但这些方法的应用受到特征复杂性和蛋白质结构信息可用性的限制。另一个局限性是预测结果难以用物理化学原理和生物学知识进行解释。

结果

为克服这些局限性，我们提出了一种使用氨基酸物理化学性质、蛋白质进化谱和蛋白质序列信息的新型特征集。我们将支持向量机和随机森林与该特征集应用于大肠杆菌乳糖阻遏物和噬菌体T4溶菌酶中发生的实验性氨基酸替换，以及广泛人类蛋白质中注释的氨基酸替换。结果表明，所提出的特征集优于现有特征集。为探索氨基酸替换背后的物理化学原理，我们设计了一种模拟退火凸点搜索策略，以自动提取氨基酸替换的可解释规则。我们将该策略应用于注释的人类氨基酸替换，并成功提取了几条与当前生物学知识一致或为理解氨基酸替换提供新见解的规则。当应用于未分类数据时，这些规则可以覆盖很大一部分样本，并且大多数被覆盖的样本与支持向量机或随机森林做出的预测显示出良好的一致性。

结论

使用所提出特征集的预测方法比使用已发表特征集的方法能够实现更大的AUC（ROC曲线下面积）、更小的BER（平衡错误率）和更大的MCC（马修斯相关系数），这表明我们的特征集优于现有特征集。通过模拟退火凸点搜索策略提取的规则具有与患者规则归纳方法（PRIM）提取的规则相当的覆盖率和准确性，但具有更好的可解释性，这表明该策略在归纳可解释规则方面更有效。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ebf/1618409/64e99846c16f/1471-2105-7-417-1.jpg

相似文献

Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy.

BMC Bioinformatics. 2006 Sep 19;7:417. doi: 10.1186/1471-2105-7-417.

Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers.

Proteins. 2008 Jun;71(4):1930-9. doi: 10.1002/prot.21838.

Predicting deleterious amino acid substitutions.

Genome Res. 2001 May;11(5):863-74. doi: 10.1101/gr.176601.

Assessment of computational methods for predicting the effects of missense mutations in human cancers.

BMC Genomics. 2013;14 Suppl 3(Suppl 3):S7. doi: 10.1186/1471-2164-14-S3-S7. Epub 2013 May 28.

Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing.

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S13. doi: 10.1186/1471-2105-13-S17-S13. Epub 2012 Dec 13.

A simulated annealing-based Bayesian network structure optimization framework for late morbidity prediction with a large prospective dataset.

Med Phys. 2025 Jun;52(6):5051-5063. doi: 10.1002/mp.17881. Epub 2025 May 21.

Predicting the functional effect of amino acid substitutions and indels.

PLoS One. 2012;7(10):e46688. doi: 10.1371/journal.pone.0046688. Epub 2012 Oct 8.

Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as "spacers" which do not require a specific sequence.

J Mol Biol. 1994 Jul 29;240(5):421-33. doi: 10.1006/jmbi.1994.1458.

Sequence feature-based prediction of protein stability changes upon amino acid substitutions.

BMC Genomics. 2010 Nov 2;11 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-11-S2-S5.

Depth dependent amino acid substitution matrices and their use in predicting deleterious mutations.

Prog Biophys Mol Biol. 2017 Sep;128:14-23. doi: 10.1016/j.pbiomolbio.2017.02.004. Epub 2017 Feb 15.

引用本文的文献

A sequence-based method to predict the impact of regulatory variants using random forest.

BMC Syst Biol. 2017 Mar 14;11(Suppl 2):7. doi: 10.1186/s12918-017-0389-1.

Integrating multiple genomic data to predict disease-causing nonsynonymous single nucleotide variants in exome sequencing studies.

PLoS Genet. 2014 Mar 20;10(3):e1004237. doi: 10.1371/journal.pgen.1004237. eCollection 2014 Mar.

Prioritizing protein complexes implicated in human diseases by network optimization.

BMC Syst Biol. 2014;8 Suppl 1(Suppl 1):S2. doi: 10.1186/1752-0509-8-S1-S2. Epub 2014 Jan 24.

Multiple co-evolutionary networks are supported by the common tertiary scaffold of the LacI/GalR proteins.

PLoS One. 2013 Dec 31;8(12):e84398. doi: 10.1371/journal.pone.0084398. eCollection 2013.

Combined rule extraction and feature elimination in supervised classification.

IEEE Trans Nanobioscience. 2012 Sep;11(3):228-36. doi: 10.1109/TNB.2012.2213264.

Novel insights from hybrid LacI/GalR proteins: family-wide functional attributes and biologically significant variation in transcription repression.

Nucleic Acids Res. 2012 Nov;40(21):11139-54. doi: 10.1093/nar/gks806. Epub 2012 Sep 10.

Gravitation field algorithm and its application in gene cluster.

Algorithms Mol Biol. 2010 Sep 20;5:32. doi: 10.1186/1748-7188-5-32.

Human allelic variation: perspective from protein function, structure, and evolution.

Curr Opin Struct Biol. 2010 Jun;20(3):342-50. doi: 10.1016/j.sbi.2010.03.006.

In silico functional profiling of human disease-associated and polymorphic amino acid substitutions.

Hum Mutat. 2010 Mar;31(3):335-46. doi: 10.1002/humu.21192.

A random forest approach to the detection of epistatic interactions in case-control studies.

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S65. doi: 10.1186/1471-2105-10-S1-S65.

本文引用的文献

Pfam: clans, web tools and services.

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D247-51. doi: 10.1093/nar/gkj149.

Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information.

Bioinformatics. 2005 May 15;21(10):2185-90. doi: 10.1093/bioinformatics/bti365. Epub 2005 Mar 3.

The Universal Protein Resource (UniProt).

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D154-9. doi: 10.1093/nar/gki070.

Sequence-based prediction of pathological mutations.

Proteins. 2004 Dec 1;57(4):811-9. doi: 10.1002/prot.20252.

A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function.

Bioinformatics. 2003 Nov 22;19(17):2199-209. doi: 10.1093/bioinformatics/btg297.

Evaluation of structural and evolutionary contributions to deleterious mutation prediction.

J Mol Biol. 2002 Sep 27;322(4):891-901. doi: 10.1016/s0022-2836(02)00813-6.

Human non-synonymous SNPs: server and survey.

Nucleic Acids Res. 2002 Sep 1;30(17):3894-900. doi: 10.1093/nar/gkf493.

Assessing the relative importance of the biophysical properties of amino acid substitutions associated with human genetic disease.

Hum Mutat. 2002 Aug;20(2):98-109. doi: 10.1002/humu.10095.

Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties.

J Mol Biol. 2002 Jan 25;315(4):771-86. doi: 10.1006/jmbi.2001.5255.

Predicting deleterious amino acid substitutions.

Genome Res. 2001 May;11(5):863-74. doi: 10.1101/gr.176601.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

寻找疾病突变的可解释规则：一种模拟退火凸点搜索策略。

Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献