逻辑最小化和规则提取在分子序列中功能位点的识别。

Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD, USA.

BioData Min. 2012 Aug 16;5(1):10. doi: 10.1186/1756-0381-5-10.

BACKGROUND

Logic minimization is the application of algebraic axioms to a binary dataset with the purpose of reducing the number of digital variables and/or rules needed to express it. Although logic minimization techniques have been applied to bioinformatics datasets before, they have not been used in classification and rule discovery problems. In this paper, we propose a method based on logic minimization to extract predictive rules for two bioinformatics problems involving the identification of functional sites in molecular sequences: transcription factor binding sites (TFBS) in DNA and O-glycosylation sites in proteins. TFBS are important in various developmental processes and glycosylation is a posttranslational modification critical to protein functions.

METHODS

In the present study, we first transformed the original biological dataset into a suitable binary form. Logic minimization was then applied to generate sets of simple rules to describe the transformed dataset. These rules were used to predict TFBS and O-glycosylation sites. The TFBS dataset is obtained from the TRANSFAC database, while the glycosylation dataset was compiled using information from OGLYCBASE and the Swiss-Prot Database.We performed the same predictions using two standard classification techniques, Artificial Neural Networks (ANN) and Support Vector Machines (SVM), and used their sensitivities and positive predictive values as benchmarks for the performance of our proposed algorithm. SVM were also used to reduce the number of variables included in the logic minimization approach.

RESULTS

For both TFBS and O-glycosylation sites, the prediction performance of the proposed logic minimization method was generally comparable and, in some cases, superior to the standard ANN and SVM classification methods with the advantage of providing intelligible rules to describe the datasets. In TFBS prediction, logic minimization produced a very small set of simple rules. In glycosylation site prediction, the rules produced were also interpretable and the most popular rules generated appeared to correlate well with recently reported hydrophilic/hydrophobic enhancement values of amino acids around possible O-glycosylation sites. Experiments with Self-Organizing Neural Networks corroborate the practical worth of the logic minimization method for these case studies.

CONCLUSIONS

The proposed logic minimization algorithm provides sets of rules that can be used to predict TFBS and O-glycosylation sites with sensitivity and positive predictive value comparable to those from ANN and SVM. Moreover, the logic minimization method has the additional capability of generating interpretable rules that allow biological scientists to correlate the predictions with other experimental results and to form new hypotheses for further investigation. Additional experiments with alternative rule-extraction techniques demonstrate that the logic minimization method is able to produce accurate rules from datasets with large numbers of variables and limited numbers of positive examples.

背景

逻辑最小化是将代数公理应用于二进制数据集的一种方法，目的是减少表达数据集所需的数字变量和/或规则的数量。虽然逻辑最小化技术以前已经应用于生物信息学数据集，但它们尚未在分类和规则发现问题中使用。在本文中，我们提出了一种基于逻辑最小化的方法，用于提取涉及识别分子序列中功能位点的两个生物信息学问题的预测规则：DNA 中的转录因子结合位点 (TFBS) 和蛋白质中的 O-糖基化位点。TFBS 在各种发育过程中很重要，糖基化是对蛋白质功能至关重要的翻译后修饰。

方法

在本研究中，我们首先将原始生物数据集转换为合适的二进制形式。然后应用逻辑最小化生成一组简单规则来描述转换后的数据集。这些规则用于预测 TFBS 和 O-糖基化位点。TFBS 数据集来自 TRANSFAC 数据库，而糖基化数据集则使用 OGLYCBASE 和 Swiss-Prot 数据库中的信息编译而成。我们使用两种标准分类技术，人工神经网络 (ANN) 和支持向量机 (SVM) 进行了相同的预测，并将它们的灵敏度和阳性预测值作为我们提出的算法性能的基准。SVM 还用于减少逻辑最小化方法中包含的变量数量。

结果

对于 TFBS 和 O-糖基化位点，所提出的逻辑最小化方法的预测性能通常相当，在某些情况下，优于标准的 ANN 和 SVM 分类方法，其优势在于提供可理解的规则来描述数据集。在 TFBS 预测中，逻辑最小化生成了一组非常简单的规则。在糖基化位点预测中，生成的规则也具有可解释性，生成的最流行规则似乎与最近报道的可能 O-糖基化位点周围氨基酸的亲水性/疏水性增强值很好地相关。自组织神经网络的实验证实了该逻辑最小化方法在这些案例研究中的实际价值。

结论

所提出的逻辑最小化算法提供了一组规则，可用于预测 TFBS 和 O-糖基化位点，其灵敏度和阳性预测值与 ANN 和 SVM 相当。此外，逻辑最小化方法还具有生成可解释规则的附加功能，允许生物科学家将预测结果与其他实验结果相关联，并形成进一步研究的新假设。使用替代规则提取技术的附加实验表明，逻辑最小化方法能够从具有大量变量和有限正例数量的数据集生成准确的规则。

相似文献

Logic minimization and rule extraction for identification of functional sites in molecular sequences.

BioData Min. 2012 Aug 16;5(1):10. doi: 10.1186/1756-0381-5-10.

Regulatory motif finding by logic regression.

Bioinformatics. 2004 Nov 1;20(16):2799-811. doi: 10.1093/bioinformatics/bth333. Epub 2004 May 27.

Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs.

BMC Bioinformatics. 2008 Feb 18;9:101. doi: 10.1186/1471-2105-9-101.

Toward better understanding of protein secondary structure: extracting prediction rules.

IEEE/ACM Trans Comput Biol Bioinform. 2011 May-Jun;8(3):858-64. doi: 10.1109/TCBB.2010.16.

Prediction of different types of liver diseases using rule based classification model.

Technol Health Care. 2013;21(5):417-32. doi: 10.3233/THC-130742.

[Rule induction algorithm for brain glioma using support vector machine].

Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2006 Apr;23(2):410-2.

Extraction of the association rules from artificial neural networks based on the multiobjective optimization.

Network. 2022 Aug-Nov;33(3-4):233-252. doi: 10.1080/0954898X.2022.2137258. Epub 2022 Oct 19.

Accurate prediction of major histocompatibility complex class II epitopes by sparse representation via ℓ 1-minimization.

BioData Min. 2014 Nov 4;7:23. doi: 10.1186/1756-0381-7-23. eCollection 2014.

Development of river ecosystem models for Flemish watercourses: case studies in the Zwalm river basin.

Meded Rijksuniv Gent Fak Landbouwkd Toegep Biol Wet. 2001;66(1):71-86.

A comparison between two neural network rule extraction techniques for the diagnosis of hepatobiliary disorders.

Artif Intell Med. 2000 Nov;20(3):205-16. doi: 10.1016/s0933-3657(00)00064-6.

引用本文的文献

Prediction of O-glycosylation Sites Using Random Forest and GA-Tuned PSO Technique.

Bioinform Biol Insights. 2015 Jul 5;9:103-9. doi: 10.4137/BBI.S26864. eCollection 2015.

本文引用的文献

Predictions of hot spot residues at protein-protein interfaces using support vector machines.

PLoS One. 2011 Feb 28;6(2):e16774. doi: 10.1371/journal.pone.0016774.

Emerging paradigms for the initiation of mucin-type protein O-glycosylation by the polypeptide GalNAc transferase family of glycosyltransferases.

J Biol Chem. 2011 Apr 22;286(16):14493-507. doi: 10.1074/jbc.M111.218701. Epub 2011 Feb 24.

Development of robust calibration models using support vector machines for spectroscopic monitoring of blood glucose.

Anal Chem. 2010 Dec 1;82(23):9719-26. doi: 10.1021/ac101754n. Epub 2010 Nov 4.

Probabilistic peak calling and controlling false discovery rate estimations in transcription factor binding site mapping from ChIP-seq.

Methods Mol Biol. 2010;674:161-77. doi: 10.1007/978-1-60761-854-6_10.

The Motif Tool Assessment Platform (MTAP) for sequence-based transcription factor binding site prediction tools.

Methods Mol Biol. 2010;674:121-41. doi: 10.1007/978-1-60761-854-6_8.

Probabilistic approaches to transcription factor binding site prediction.

Methods Mol Biol. 2010;674:97-119. doi: 10.1007/978-1-60761-854-6_7.

Least-Squares Support Vector Machine Approach to Viral Replication Origin Prediction.

INFORMS J Comput. 2010 Jun 1;22(3):457-470. doi: 10.1287/ijoc.1090.0360.

Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction.

Nucleic Acids Res. 2010 Jul;38(12):e135. doi: 10.1093/nar/gkq274. Epub 2010 May 3.

The early history of the Sox genes.

Int J Biochem Cell Biol. 2010 Mar;42(3):378-80. doi: 10.1016/j.biocel.2009.12.003. Epub 2009 Dec 31.

Mucin-type O-glycosylation--putting the pieces together.

FEBS J. 2010 Jan;277(1):81-94. doi: 10.1111/j.1742-4658.2009.07429.x. Epub 2009 Nov 17.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Logic minimization and rule extraction for identification of functional sites in molecular sequences.

BioData Min. 2012 Aug 16;5(1):10. doi: 10.1186/1756-0381-5-10.

Regulatory motif finding by logic regression.

Bioinformatics. 2004 Nov 1;20(16):2799-811. doi: 10.1093/bioinformatics/bth333. Epub 2004 May 27.

Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs.

BMC Bioinformatics. 2008 Feb 18;9:101. doi: 10.1186/1471-2105-9-101.

Toward better understanding of protein secondary structure: extracting prediction rules.

IEEE/ACM Trans Comput Biol Bioinform. 2011 May-Jun;8(3):858-64. doi: 10.1109/TCBB.2010.16.

Prediction of different types of liver diseases using rule based classification model.

Technol Health Care. 2013;21(5):417-32. doi: 10.3233/THC-130742.

[Rule induction algorithm for brain glioma using support vector machine].

Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2006 Apr;23(2):410-2.

Extraction of the association rules from artificial neural networks based on the multiobjective optimization.

Network. 2022 Aug-Nov;33(3-4):233-252. doi: 10.1080/0954898X.2022.2137258. Epub 2022 Oct 19.

Accurate prediction of major histocompatibility complex class II epitopes by sparse representation via ℓ 1-minimization.

BioData Min. 2014 Nov 4;7:23. doi: 10.1186/1756-0381-7-23. eCollection 2014.

Development of river ecosystem models for Flemish watercourses: case studies in the Zwalm river basin.

Meded Rijksuniv Gent Fak Landbouwkd Toegep Biol Wet. 2001;66(1):71-86.

A comparison between two neural network rule extraction techniques for the diagnosis of hepatobiliary disorders.

Artif Intell Med. 2000 Nov;20(3):205-16. doi: 10.1016/s0933-3657(00)00064-6.

引用本文的文献

Prediction of O-glycosylation Sites Using Random Forest and GA-Tuned PSO Technique.

Bioinform Biol Insights. 2015 Jul 5;9:103-9. doi: 10.4137/BBI.S26864. eCollection 2015.

本文引用的文献

Predictions of hot spot residues at protein-protein interfaces using support vector machines.

PLoS One. 2011 Feb 28;6(2):e16774. doi: 10.1371/journal.pone.0016774.

Emerging paradigms for the initiation of mucin-type protein O-glycosylation by the polypeptide GalNAc transferase family of glycosyltransferases.

J Biol Chem. 2011 Apr 22;286(16):14493-507. doi: 10.1074/jbc.M111.218701. Epub 2011 Feb 24.

Development of robust calibration models using support vector machines for spectroscopic monitoring of blood glucose.

Anal Chem. 2010 Dec 1;82(23):9719-26. doi: 10.1021/ac101754n. Epub 2010 Nov 4.

Probabilistic peak calling and controlling false discovery rate estimations in transcription factor binding site mapping from ChIP-seq.

Methods Mol Biol. 2010;674:161-77. doi: 10.1007/978-1-60761-854-6_10.

The Motif Tool Assessment Platform (MTAP) for sequence-based transcription factor binding site prediction tools.

Methods Mol Biol. 2010;674:121-41. doi: 10.1007/978-1-60761-854-6_8.

Probabilistic approaches to transcription factor binding site prediction.

Methods Mol Biol. 2010;674:97-119. doi: 10.1007/978-1-60761-854-6_7.

Least-Squares Support Vector Machine Approach to Viral Replication Origin Prediction.

INFORMS J Comput. 2010 Jun 1;22(3):457-470. doi: 10.1287/ijoc.1090.0360.

Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction.

Nucleic Acids Res. 2010 Jul;38(12):e135. doi: 10.1093/nar/gkq274. Epub 2010 May 3.

The early history of the Sox genes.

Int J Biochem Cell Biol. 2010 Mar;42(3):378-80. doi: 10.1016/j.biocel.2009.12.003. Epub 2009 Dec 31.

Mucin-type O-glycosylation--putting the pieces together.

FEBS J. 2010 Jan;277(1):81-94. doi: 10.1111/j.1742-4658.2009.07429.x. Epub 2009 Nov 17.

Logic minimization and rule extraction for identification of functional sites in molecular sequences.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献