Saleem Muhammad, Padmanabhuni Shanmukha S, Ngomo Axel-Cyrille Ngonga, Iqbal Aftab, Almeida Jonas S, Decker Stefan, Deus Helena F
Universität Leipzig, IFI/AKSW, PO 100920, D-04009 Leipzig, Germany.
Insight Centre for Data Analytics, National University of Ireland (NUIG), Galway, Ireland.
J Biomed Semantics. 2014 Dec 3;5:47. doi: 10.1186/2041-1480-5-47. eCollection 2014.
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis, to be exploited by bioinformaticians towards advancing cancer knowledge. However, devising bioinformatics applications to analyse such large dataset is still challenging, as it often requires downloading large archives and parsing the relevant text files. Therefore, it is making it difficult to enable virtual data integration in order to collect the critical co-variates necessary for analysis.
We address these issues by transforming the TCGA data into the Semantic Web standard Resource Description Format (RDF), link it to relevant datasets in the Linked Open Data (LOD) cloud and further propose an efficient data distribution strategy to host the resulting 20.4 billion triples data via several SPARQL endpoints. Having the TCGA data distributed across multiple SPARQL endpoints, we enable biomedical scientists to query and retrieve information from these SPARQL endpoints by proposing a TCGA tailored federated SPARQL query processing engine named TopFed.
We compare TopFed with a well established federation engine FedX in terms of source selection and query execution time by using 10 different federated SPARQL queries with varying requirements. Our evaluation results show that TopFed selects on average less than half of the sources (with 100% recall) with query execution time equal to one third to that of FedX.
With TopFed, we aim to offer biomedical scientists a single-point-of-access through which distributed TCGA data can be accessed in unison. We believe the proposed system can greatly help researchers in the biomedical domain to carry out their research effectively with TCGA as the amount and diversity of data exceeds the ability of local resources to handle its retrieval and parsing.
癌症基因组图谱(TCGA)是一项多学科、多机构的工作,旨在利用基因组分析技术对导致癌症的基因突变进行编目。该项目的目标之一是创建一个全面且开放的癌症相关分子分析知识库,供生物信息学家用于增进对癌症的了解。然而,设计生物信息学应用程序来分析如此庞大的数据集仍然具有挑战性,因为这通常需要下载大型存档并解析相关文本文件。因此,难以实现虚拟数据集成以收集分析所需的关键协变量。
我们通过将TCGA数据转换为语义网标准资源描述框架(RDF)来解决这些问题,将其链接到关联开放数据(LOD)云中的相关数据集,并进一步提出一种高效的数据分发策略,通过多个SPARQL端点托管生成的204亿个三元组数据。由于TCGA数据分布在多个SPARQL端点上,我们通过提出一个名为TopFed的针对TCGA定制的联邦SPARQL查询处理引擎,使生物医学科学家能够从这些SPARQL端点查询和检索信息。
我们使用10个具有不同要求的不同联邦SPARQL查询,在源选择和查询执行时间方面将TopFed与成熟的联邦引擎FedX进行比较。我们的评估结果表明,TopFed平均选择的源不到一半(召回率为100%),查询执行时间仅为FedX的三分之一。
通过TopFed,我们旨在为生物医学科学家提供一个单点访问点,通过该点可以统一访问分布式的TCGA数据。我们相信,所提出的系统可以极大地帮助生物医学领域的研究人员有效地利用TCGA进行研究,因为数据的数量和多样性超出了本地资源处理其检索和解析的能力。