Department of Zoology, University of Oxford, Oxford OX1 3PS, UK.
J Biomed Inform. 2010 Oct;43(5):752-61. doi: 10.1016/j.jbi.2010.04.004.
Integrating heterogeneous data across distributed sources is a major requirement for in silico bioinformatics supporting translational research. For example, genome-scale data on patterns of gene expression in the fruit fly Drosophila melanogaster are widely used in functional genomic studies in many organisms to inform candidate gene selection and validate experimental results. However, current data integration solutions tend to be heavy weight, and require significant initial and ongoing investment of effort. Development of a common Web-based data integration infrastructure (a.k.a. data web), using Semantic Web standards, promises to alleviate these difficulties, but little is known about the feasibility, costs, risks or practical means of migrating to such an infrastructure.
We describe the development of OpenFlyData, a proof-of-concept system integrating gene expression data on D. melanogaster, combining Semantic Web standards with light-weight approaches to Web programming based on Web 2.0 design patterns. To support researchers designing and validating functional genomic studies, OpenFlyData includes user-facing search applications providing intuitive access to and comparison of gene expression data from FlyAtlas, the BDGP in situ database, and FlyTED, using data from FlyBase to expand and disambiguate gene names. OpenFlyData's services are also openly accessible, and are available for reuse by other bioinformaticians and application developers. Semi-automated methods and tools were developed to support labour- and knowledge-intensive tasks involved in deploying SPARQL services. These include methods for generating ontologies and relational-to-RDF mappings for relational databases, which we illustrate using the FlyBase Chado database schema; and methods for mapping gene identifiers between databases. The advantages of using Semantic Web standards for biomedical data integration are discussed, as are open issues. In particular, although the performance of open source SPARQL implementations is sufficient to query gene expression data directly from user-facing applications such as Web-based data fusions (a.k.a. mashups), we found open SPARQL endpoints to be vulnerable to denial-of-service-type problems, which must be mitigated to ensure reliability of services based on this standard. These results are relevant to data integration activities in translational bioinformatics.
The gene expression search applications and SPARQL endpoints developed for OpenFlyData are deployed at http://openflydata.org. FlyUI, a library of JavaScript widgets providing re-usable user-interface components for Drosophila gene expression data, is available at http://flyui.googlecode.com. Software and ontologies to support transformation of data from FlyBase, FlyAtlas, BDGP and FlyTED to RDF are available at http://openflydata.googlecode.com. SPARQLite, an implementation of the SPARQL protocol, is available at http://sparqlite.googlecode.com. All software is provided under the GPL version 3 open source license.
整合分布源中的异构数据是支持转化研究的计算生物信息学的主要要求。例如,在果蝇果蝇中基因表达模式的全基因组数据广泛用于许多生物体的功能基因组研究,以提供候选基因选择和验证实验结果。然而,当前的数据集成解决方案往往是重量级的,并且需要大量的初始和持续的努力。使用语义 Web 标准开发通用的基于 Web 的数据集成基础架构(也称为数据 Web)有望缓解这些困难,但对于迁移到这种基础架构的可行性、成本、风险或实际手段知之甚少。
我们描述了 OpenFlyData 的开发,这是一个概念验证系统,它整合了果蝇的基因表达数据,将语义 Web 标准与基于 Web 2.0 设计模式的轻量级 Web 编程方法结合在一起。为了支持设计和验证功能基因组研究的研究人员,OpenFlyData 包括面向用户的搜索应用程序,使用 FlyBase 来扩展和消除基因名称的歧义,为 FlyAtlas、BDGP 原位数据库和 FlyTED 中的基因表达数据提供直观的访问和比较。OpenFlyData 的服务也可以公开访问,并可供其他生物信息学家和应用程序开发人员重用。开发了半自动方法和工具来支持部署 SPARQL 服务所涉及的劳动和知识密集型任务。这些方法包括为关系数据库生成本体和关系到 RDF 的映射的方法,我们使用 FlyBase Chado 数据库模式说明了这些方法;以及在数据库之间映射基因标识符的方法。讨论了使用语义 Web 标准进行生物医学数据集成的优势,以及存在的问题。特别是,尽管开源 SPARQL 实现的性能足以直接从基于 Web 的数据融合(也称为混搭)等面向用户的应用程序查询基因表达数据,但我们发现开放 SPARQL 端点容易受到拒绝服务类型问题的影响,必须缓解这些问题以确保基于此标准的服务的可靠性。这些结果与转化生物信息学中的数据集成活动有关。
为 OpenFlyData 开发的基因表达搜索应用程序和 SPARQL 端点部署在 http://openflydata.org 上。FlyUI 是一个 JavaScript 小部件库,为果蝇基因表达数据提供可重用的用户界面组件,可在 http://flyui.googlecode.com 上获得。支持将数据从 FlyBase、FlyAtlas、BDGP 和 FlyTED 转换为 RDF 的软件和本体可在 http://openflydata.googlecode.com 上获得。SPARQLite 是 SPARQL 协议的实现,可在 http://sparqlite.googlecode.com 上获得。所有软件均根据 GPL 版本 3 开源许可证提供。