Muilu Juha, Peltonen Leena, Litton Jan-Eric
Finnish Genome Center, University of Helsinki, Helsinki, Finland.
Eur J Hum Genet. 2007 Jul;15(7):718-23. doi: 10.1038/sj.ejhg.5201850. Epub 2007 May 9.
Integration of complex data and data management represent major challenges in large-scale biobank-based post-genome era research projects like GenomEUtwin (an international collaboration between eight Twin Registries) with extensive amounts of genotype and phenotype data combined from different data sources located in different countries. The challenge lies not only in data harmonization and constant update of clinical details in various locations, but also in the heterogeneity of data storage and confidentiality of sensitive health-related and genetic data. Solid infrastructure must be built to provide secure, but easily accessible and standardized, data exchange also facilitating statistical analyses of the stored data. Data collection sites desire to have full control of the accumulation of data, and at the same time the integration should facilitate effortless slicing and dicing of the data for different types of data pooling and study designs. Here we describe how we constructed a federated database infrastructure for genotype and phenotype information collected in seven European countries and Australia and connected this database setting via a network called TwinNET to guarantee effortless data exchange and pooled analyses. This federated database system offers a powerful facility for combining different types of information from multiple data sources. The system is transparent to end users and application developers, since it makes the set of federated data sources look like a single system. The user need not be aware of the format or site where the data are stored, the language or programming interface of the data source, how the data are physically stored, whether they are partitioned and/or replicated or what networking protocols are used. The user sees a single standardized interface with the desired data elements for pooled analyses.
在基于大型生物样本库的后基因组时代研究项目中,如GenomEUtwin(八个双胞胎登记处之间的国际合作项目),整合复杂数据和数据管理面临重大挑战,该项目整合了来自不同国家不同数据源的大量基因型和表型数据。挑战不仅在于数据协调以及不同地点临床细节的持续更新,还在于数据存储的异质性以及敏感健康相关数据和遗传数据的保密性。必须构建坚实的基础设施,以提供安全但易于访问且标准化的数据交换,同时便于对存储的数据进行统计分析。数据收集站点希望能完全掌控数据的积累,与此同时,整合应便于轻松地对数据进行切片和切块,以用于不同类型的数据汇总和研究设计。在此,我们描述了我们如何为在七个欧洲国家和澳大利亚收集的基因型和表型信息构建一个联邦数据库基础设施,并通过一个名为TwinNET的网络连接此数据库设置,以确保轻松的数据交换和汇总分析。这个联邦数据库系统为整合来自多个数据源的不同类型信息提供了强大的工具。该系统对终端用户和应用程序开发者是透明的,因为它使联邦数据源集看起来像一个单一系统。用户无需知晓数据存储的格式或地点、数据源的语言或编程接口、数据的物理存储方式、是否进行了分区和/或复制,以及使用了何种网络协议。用户看到的是一个带有用于汇总分析的所需数据元素的单一标准化接口。