Shah I, Hunter L
American Type Culture Collection, Manassas, VA 20110, USA.
Pac Symp Biocomput. 2000:278-90. doi: 10.1142/9789814447331_0026.
This paper describes a unified framework for visualizing the preparations for, and results of, hundreds of machine learning experiments. These experiments were designed to improve the accuracy of enzyme functional predictions from sequence, and in many cases were successful. Our system provides graphical user interfaces for defining and exploring training datasets and various representational alternatives, for inspecting the hypotheses induced by various types of learning algorithms, for visualizing the global results, and for inspecting in detail results for specific training sets (functions) and examples (proteins). The visualization tools serve as a navigational aid through a large amount of sequence data and induced knowledge. They provided significant help in understanding both the significance and the underlying biological explanations of our successes and failures. Using these visualizations it was possible to efficiently identify weaknesses of the modular sequence representations and induction algorithms which suggest better learning strategies. The context in which our data mining visualization toolkit was developed was the problem of accurately predicting enzyme function from protein sequence data. Previous work demonstrated that approximately 6% of enzyme protein sequences are likely to be assigned incorrect functions on the basis of sequence similarity alone. In order to test the hypothesis that more detailed sequence analysis using machine learning techniques and modular domain representations could address many of these failures, we designed a series of more than 250 experiments using information-theoretic decision tree induction and naive Bayesian learning on local sequence domain representations of problematic enzyme function classes. In more than half of these cases, our methods were able to perfectly discriminate among various possible functions of similar sequences. We developed and tested our visualization techniques on this application.
本文描述了一个统一的框架,用于可视化数百次机器学习实验的准备过程和结果。这些实验旨在提高基于序列的酶功能预测的准确性,并且在许多情况下都取得了成功。我们的系统提供了图形用户界面,用于定义和探索训练数据集以及各种表示方式,用于检查各种学习算法所诱导的假设,用于可视化全局结果,以及用于详细检查特定训练集(函数)和示例(蛋白质)的结果。这些可视化工具作为一种导航辅助手段,帮助处理大量的序列数据和所诱导的知识。它们在理解我们成功与失败的意义及潜在生物学解释方面提供了重要帮助。通过使用这些可视化,能够有效地识别模块化序列表示和归纳算法的弱点,从而提出更好的学习策略。我们开发数据挖掘可视化工具包所针对的背景问题是从蛋白质序列数据中准确预测酶功能。先前的工作表明,仅基于序列相似性,大约6%的酶蛋白序列可能会被赋予错误的功能。为了检验使用机器学习技术和模块化结构域表示进行更详细的序列分析可以解决许多此类错误的假设,我们设计了一系列超过250次的实验,使用信息论决策树归纳法和朴素贝叶斯学习法,针对有问题的酶功能类别的局部序列结构域表示进行研究。在超过一半的此类案例中,我们的方法能够完美地区分相似序列的各种可能功能。我们在这个应用上开发并测试了我们的可视化技术。