• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大规模数据挖掘项目的可视化管理

Visual management of large scale data mining projects.

作者信息

Shah I, Hunter L

机构信息

American Type Culture Collection, Manassas, VA 20110, USA.

出版信息

Pac Symp Biocomput. 2000:278-90. doi: 10.1142/9789814447331_0026.

DOI:10.1142/9789814447331_0026
PMID:10902176
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2709531/
Abstract

This paper describes a unified framework for visualizing the preparations for, and results of, hundreds of machine learning experiments. These experiments were designed to improve the accuracy of enzyme functional predictions from sequence, and in many cases were successful. Our system provides graphical user interfaces for defining and exploring training datasets and various representational alternatives, for inspecting the hypotheses induced by various types of learning algorithms, for visualizing the global results, and for inspecting in detail results for specific training sets (functions) and examples (proteins). The visualization tools serve as a navigational aid through a large amount of sequence data and induced knowledge. They provided significant help in understanding both the significance and the underlying biological explanations of our successes and failures. Using these visualizations it was possible to efficiently identify weaknesses of the modular sequence representations and induction algorithms which suggest better learning strategies. The context in which our data mining visualization toolkit was developed was the problem of accurately predicting enzyme function from protein sequence data. Previous work demonstrated that approximately 6% of enzyme protein sequences are likely to be assigned incorrect functions on the basis of sequence similarity alone. In order to test the hypothesis that more detailed sequence analysis using machine learning techniques and modular domain representations could address many of these failures, we designed a series of more than 250 experiments using information-theoretic decision tree induction and naive Bayesian learning on local sequence domain representations of problematic enzyme function classes. In more than half of these cases, our methods were able to perfectly discriminate among various possible functions of similar sequences. We developed and tested our visualization techniques on this application.

摘要

本文描述了一个统一的框架,用于可视化数百次机器学习实验的准备过程和结果。这些实验旨在提高基于序列的酶功能预测的准确性,并且在许多情况下都取得了成功。我们的系统提供了图形用户界面,用于定义和探索训练数据集以及各种表示方式,用于检查各种学习算法所诱导的假设,用于可视化全局结果,以及用于详细检查特定训练集(函数)和示例(蛋白质)的结果。这些可视化工具作为一种导航辅助手段,帮助处理大量的序列数据和所诱导的知识。它们在理解我们成功与失败的意义及潜在生物学解释方面提供了重要帮助。通过使用这些可视化,能够有效地识别模块化序列表示和归纳算法的弱点,从而提出更好的学习策略。我们开发数据挖掘可视化工具包所针对的背景问题是从蛋白质序列数据中准确预测酶功能。先前的工作表明,仅基于序列相似性,大约6%的酶蛋白序列可能会被赋予错误的功能。为了检验使用机器学习技术和模块化结构域表示进行更详细的序列分析可以解决许多此类错误的假设,我们设计了一系列超过250次的实验,使用信息论决策树归纳法和朴素贝叶斯学习法,针对有问题的酶功能类别的局部序列结构域表示进行研究。在超过一半的此类案例中,我们的方法能够完美地区分相似序列的各种可能功能。我们在这个应用上开发并测试了我们的可视化技术。

相似文献

1
Visual management of large scale data mining projects.大规模数据挖掘项目的可视化管理
Pac Symp Biocomput. 2000:278-90. doi: 10.1142/9789814447331_0026.
2
Identification of divergent functions in homologous proteins by induction over conserved modules.
Proc Int Conf Intell Syst Mol Biol. 1998;6:157-64.
3
ProteoLens: a visual analytic tool for multi-scale database-driven biological network data mining.ProteoLens:一种用于多尺度数据库驱动的生物网络数据挖掘的可视化分析工具。
BMC Bioinformatics. 2008 Aug 12;9 Suppl 9(Suppl 9):S5. doi: 10.1186/1471-2105-9-S9-S5.
4
5
A system for exploring and visualizing biological pathways from large scale datasets.一种用于从大规模数据集中探索和可视化生物途径的系统。
Annu Int Conf IEEE Eng Med Biol Soc. 2008;2008:4086-9. doi: 10.1109/IEMBS.2008.4650107.
6
Suffix tree searcher: exploration of common substrings in large DNA sequence sets.后缀树搜索器:大型DNA序列集中常见子串的探索
BMC Res Notes. 2014 Jul 23;7:466. doi: 10.1186/1756-0500-7-466.
7
Better prediction of protein cellular localization sites with the k nearest neighbors classifier.使用k近邻分类器更好地预测蛋白质细胞定位位点。
Proc Int Conf Intell Syst Mol Biol. 1997;5:147-52.
8
Data mining in bioinformatics using Weka.使用Weka进行生物信息学中的数据挖掘。
Bioinformatics. 2004 Oct 12;20(15):2479-81. doi: 10.1093/bioinformatics/bth261. Epub 2004 Apr 8.
9
BONSAI Garden: parallel knowledge discovery system for amino acid sequences.盆景园:氨基酸序列并行知识发现系统
Proc Int Conf Intell Syst Mol Biol. 1995;3:359-66.
10
Distributed data mining on grids: services, tools, and applications.网格上的分布式数据挖掘:服务、工具与应用。
IEEE Trans Syst Man Cybern B Cybern. 2004 Dec;34(6):2451-65. doi: 10.1109/tsmcb.2004.836890.

本文引用的文献

1
Identification of divergent functions in homologous proteins by induction over conserved modules.
Proc Int Conf Intell Syst Mol Biol. 1998;6:157-64.
2
Visualization based on the Enzyme Commission nomenclature.基于酶委员会命名法的可视化。
Pac Symp Biocomput. 1998:142-52.
3
Predicting enzyme function from sequence: a systematic appraisal.从序列预测酶功能:系统评估。
Proc Int Conf Intell Syst Mol Biol. 1997;5:276-83.
4
Pfam: a comprehensive database of protein domain families based on seed alignments.Pfam:一个基于种子比对的蛋白质结构域家族综合数据库。
Proteins. 1997 Jul;28(3):405-20. doi: 10.1002/(sici)1097-0134(199707)28:3<405::aid-prot10>3.0.co;2-l.
5
Modular arrangement of proteins as inferred from analysis of homology.从同源性分析推断出的蛋白质模块化排列。
Protein Sci. 1994 Mar;3(3):482-92. doi: 10.1002/pro.5560030314.
6
The ENZYME data bank.酶数据库。
Nucleic Acids Res. 1994 Sep;22(17):3626-7.
7
The SWISS-PROT protein sequence data bank.瑞士蛋白质序列数据库。
Nucleic Acids Res. 1992 May 11;20 Suppl(Suppl):2019-22. doi: 10.1093/nar/20.suppl.2019.