Lacroix Zoé, Raschid Louiqa, Eckman Barbara A
Arizona State University, PO Box 876106, Tempe, Arizona 85287-6106, USA.
J Bioinform Comput Biol. 2004 Jun;2(2):375-411. doi: 10.1142/s0219720004000648.
Today, scientific data are inevitably digitized, stored in a wide variety of formats, and are accessible over the Internet. Scientific discovery increasingly involves accessing multiple heterogeneous data sources, integrating the results of complex queries, and applying further analysis and visualization applications in order to collect datasets of interest. Building a scientific integration platform to support these critical tasks requires accessing and manipulating data extracted from flat files or databases, documents retrieved from the Web, as well as data that are locally materialized in warehouses or generated by software. The lack of efficiency of existing approaches can significantly affect the process with lengthy delays while accessing critical resources or with the failure of the system to report any results. Some queries take so much time to be answered that their results are returned via email, making their integration with other results a tedious task. This paper presents several issues that need to be addressed to provide seamless and efficient integration of biomolecular data. Identified challenges include: capturing and representing various domain specific computational capabilities supported by a source including sequence or text search engines and traditional query processing; developing a methodology to acquire and represent semantic knowledge and metadata about source contents, overlap in source contents, and access costs; developing cost and semantics based decision support tools to select sources and capabilities, and to generate efficient query evaluation plans.
如今,科学数据不可避免地被数字化,以各种各样的格式存储,并可通过互联网获取。科学发现越来越多地涉及访问多个异构数据源、整合复杂查询的结果,以及应用进一步的分析和可视化应用程序,以便收集感兴趣的数据集。构建一个科学集成平台来支持这些关键任务,需要访问和处理从平面文件或数据库中提取的数据、从网络检索的文档,以及在仓库中本地实现或由软件生成的数据。现有方法效率低下,在访问关键资源时可能会导致长时间延迟,或者系统无法报告任何结果,从而严重影响整个过程。有些查询需要很长时间才能得到答案,以至于其结果通过电子邮件返回,这使得将其与其他结果进行整合成为一项繁琐的任务。本文提出了几个需要解决的问题,以实现生物分子数据的无缝高效集成。已识别的挑战包括:捕获和表示源所支持的各种特定领域的计算能力,包括序列或文本搜索引擎以及传统查询处理;开发一种方法来获取和表示关于源内容、源内容重叠以及访问成本的语义知识和元数据;开发基于成本和语义的决策支持工具,以选择源和能力,并生成高效的查询评估计划。