Suppr超能文献

逻辑最小化和规则提取在分子序列中功能位点的识别。

Logic minimization and rule extraction for identification of functional sites in molecular sequences.

机构信息

Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD, USA.

出版信息

BioData Min. 2012 Aug 16;5(1):10. doi: 10.1186/1756-0381-5-10.

Abstract

BACKGROUND

Logic minimization is the application of algebraic axioms to a binary dataset with the purpose of reducing the number of digital variables and/or rules needed to express it. Although logic minimization techniques have been applied to bioinformatics datasets before, they have not been used in classification and rule discovery problems. In this paper, we propose a method based on logic minimization to extract predictive rules for two bioinformatics problems involving the identification of functional sites in molecular sequences: transcription factor binding sites (TFBS) in DNA and O-glycosylation sites in proteins. TFBS are important in various developmental processes and glycosylation is a posttranslational modification critical to protein functions.

METHODS

In the present study, we first transformed the original biological dataset into a suitable binary form. Logic minimization was then applied to generate sets of simple rules to describe the transformed dataset. These rules were used to predict TFBS and O-glycosylation sites. The TFBS dataset is obtained from the TRANSFAC database, while the glycosylation dataset was compiled using information from OGLYCBASE and the Swiss-Prot Database.We performed the same predictions using two standard classification techniques, Artificial Neural Networks (ANN) and Support Vector Machines (SVM), and used their sensitivities and positive predictive values as benchmarks for the performance of our proposed algorithm. SVM were also used to reduce the number of variables included in the logic minimization approach.

RESULTS

For both TFBS and O-glycosylation sites, the prediction performance of the proposed logic minimization method was generally comparable and, in some cases, superior to the standard ANN and SVM classification methods with the advantage of providing intelligible rules to describe the datasets. In TFBS prediction, logic minimization produced a very small set of simple rules. In glycosylation site prediction, the rules produced were also interpretable and the most popular rules generated appeared to correlate well with recently reported hydrophilic/hydrophobic enhancement values of amino acids around possible O-glycosylation sites. Experiments with Self-Organizing Neural Networks corroborate the practical worth of the logic minimization method for these case studies.

CONCLUSIONS

The proposed logic minimization algorithm provides sets of rules that can be used to predict TFBS and O-glycosylation sites with sensitivity and positive predictive value comparable to those from ANN and SVM. Moreover, the logic minimization method has the additional capability of generating interpretable rules that allow biological scientists to correlate the predictions with other experimental results and to form new hypotheses for further investigation. Additional experiments with alternative rule-extraction techniques demonstrate that the logic minimization method is able to produce accurate rules from datasets with large numbers of variables and limited numbers of positive examples.

摘要

背景

逻辑最小化是将代数公理应用于二进制数据集的一种方法,目的是减少表达数据集所需的数字变量和/或规则的数量。虽然逻辑最小化技术以前已经应用于生物信息学数据集,但它们尚未在分类和规则发现问题中使用。在本文中,我们提出了一种基于逻辑最小化的方法,用于提取涉及识别分子序列中功能位点的两个生物信息学问题的预测规则:DNA 中的转录因子结合位点 (TFBS) 和蛋白质中的 O-糖基化位点。TFBS 在各种发育过程中很重要,糖基化是对蛋白质功能至关重要的翻译后修饰。

方法

在本研究中,我们首先将原始生物数据集转换为合适的二进制形式。然后应用逻辑最小化生成一组简单规则来描述转换后的数据集。这些规则用于预测 TFBS 和 O-糖基化位点。TFBS 数据集来自 TRANSFAC 数据库,而糖基化数据集则使用 OGLYCBASE 和 Swiss-Prot 数据库中的信息编译而成。我们使用两种标准分类技术,人工神经网络 (ANN) 和支持向量机 (SVM) 进行了相同的预测,并将它们的灵敏度和阳性预测值作为我们提出的算法性能的基准。SVM 还用于减少逻辑最小化方法中包含的变量数量。

结果

对于 TFBS 和 O-糖基化位点,所提出的逻辑最小化方法的预测性能通常相当,在某些情况下,优于标准的 ANN 和 SVM 分类方法,其优势在于提供可理解的规则来描述数据集。在 TFBS 预测中,逻辑最小化生成了一组非常简单的规则。在糖基化位点预测中,生成的规则也具有可解释性,生成的最流行规则似乎与最近报道的可能 O-糖基化位点周围氨基酸的亲水性/疏水性增强值很好地相关。自组织神经网络的实验证实了该逻辑最小化方法在这些案例研究中的实际价值。

结论

所提出的逻辑最小化算法提供了一组规则,可用于预测 TFBS 和 O-糖基化位点,其灵敏度和阳性预测值与 ANN 和 SVM 相当。此外,逻辑最小化方法还具有生成可解释规则的附加功能,允许生物科学家将预测结果与其他实验结果相关联,并形成进一步研究的新假设。使用替代规则提取技术的附加实验表明,逻辑最小化方法能够从具有大量变量和有限正例数量的数据集生成准确的规则。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/289a/3492099/666bd252ec70/1756-0381-5-10-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验