Department of Bioinformatics and Computational Biology, The University of Texas M. D. Anderson Cancer Center, 1515 Holcombe Blvd., Unit 1410, Houston, TX 77230-1402, USA.
J Biomed Inform. 2010 Dec;43(6):998-1008. doi: 10.1016/j.jbi.2010.09.004.
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.
癌症基因组图谱 (TCGA) 是一项多学科、多机构的努力,旨在描述几种类型的癌症。来自 TCGA 等生物医学领域的数据集对于那些有兴趣动态聚合其结果的人来说是一个特别具有挑战性的任务,因为数据源通常是异构的和分布式的。链接数据最佳实践为整合和发现具有这些特征的数据提供了一种解决方案,即通过将数据作为支持 SPARQL 的 Web 服务暴露,SPARQL 是资源描述框架查询语言。然而,大多数 SPARQL 端点都不容易被数据专家查询。此外,将实验数据暴露为 SPARQL 端点仍然是一项具有挑战性的任务,因为在大多数情况下,数据必须首先转换为资源描述框架三元组。根据这些要求,我们开发了一种基础架构,通过将元素分配给简单松散语义数据库 (S3DB) 管理模型的实体,将由 TCGA 生成的临床、人口统计学和分子数据元素作为 SPARQL 端点公开。基础架构的所有组件都作为独立的表示状态转移 (REST) Web 服务提供,以鼓励重用,并开发了一个简单的界面,通过导航 TCGA 领域的表示来自动组装 SPARQL 查询。该解决方案的一个关键特性是极大地方便了 SPARQL 查询的组装,它区分了 TCGA 领域描述符和数据元素。此外,使用 S3DB 管理模型作为中介,可以查询公共和受保护的数据,而无需事先提交给单个数据源。