Amigo Jorge, Phillips Christopher, Salas Antonio, Carracedo Angel
Spanish National Genotyping Center (CeGen), Genomic Medicine Group, CIBERER, University of Santiago de Compostela, Galicia, Spain.
BMC Bioinformatics. 2009 Mar 19;10 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2105-10-S3-S5.
Databases containing very large amounts of SNP (Single Nucleotide Polymorphism) data are now freely available for researchers interested in medical and/or population genetics applications. While many of these SNP repositories have implemented data retrieval tools for general-purpose mining, these alone cannot cover the broad spectrum of needs of most medical and population genetics studies.
To address this limitation, we have built in-house customized data marts from the raw data provided by the largest public databases. In particular, for population genetics analysis based on genotypes we have built a set of data processing scripts that deal with raw data coming from the major SNP variation databases (e.g. HapMap, Perlegen), stripping them into single genotypes and then grouping them into populations, then merged with additional complementary descriptive information extracted from dbSNP. This allows not only in-house standardization and normalization of the genotyping data retrieved from different repositories, but also the calculation of statistical indices from simple allele frequency estimates to more elaborate genetic differentiation tests within populations, together with the ability to combine population samples from different databases.
The present study demonstrates the viability of implementing scripts for handling extensive datasets of SNP genotypes with low computational costs, dealing with certain complex issues that arise from the divergent nature and configuration of the most popular SNP repositories. The information contained in these databases can also be enriched with additional information obtained from other complementary databases, in order to build a dedicated data mart. Updating the data structure is straightforward, as well as permitting easy implementation of new external data and the computation of supplementary statistical indices of interest.
对于对医学和/或群体遗传学应用感兴趣的研究人员来说,现在可以免费获得包含大量单核苷酸多态性(SNP)数据的数据库。虽然这些SNP存储库中的许多都已实现了用于通用挖掘的数据检索工具,但仅这些工具无法满足大多数医学和群体遗传学研究的广泛需求。
为了解决这一限制,我们根据最大的公共数据库提供的原始数据构建了内部定制的数据集市。特别是,对于基于基因型的群体遗传学分析,我们编写了一组数据处理脚本,用于处理来自主要SNP变异数据库(如HapMap、Perlegen)的原始数据,将其拆分为单个基因型,然后按群体进行分组,再与从dbSNP中提取的其他补充描述性信息合并。这不仅实现了从不同存储库检索到的基因分型数据的内部标准化和规范化,还能从简单的等位基因频率估计到群体内更精细的遗传分化测试进行统计指标计算,同时具备合并来自不同数据库的群体样本的能力。
本研究证明了以低计算成本实现处理大量SNP基因型数据集脚本的可行性,解决了因最流行的SNP存储库的不同性质和配置而产生的某些复杂问题。这些数据库中包含的信息还可以通过从其他补充数据库获得的额外信息进行丰富,以构建一个专用的数据集市。更新数据结构很简单,并且便于实现新的外部数据以及计算感兴趣的补充统计指标。