Yang Jin Ok, Hwang Sohyun, Oh Jeongsu, Bhak Jong, Sohn Tae-Kwon
Korean BioInformation Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, 305-806, Korea.
BMC Bioinformatics. 2008 Dec 12;9 Suppl 12(Suppl 12):S19. doi: 10.1186/1471-2105-9-S12-S19.
Studies on the relationship between disease and genetic variations such as single nucleotide polymorphisms (SNPs) are important. Genetic variations can cause disease by influencing important biological regulation processes. Despite the needs for analyzing SNP and disease correlation, most existing databases provide information only on functional variants at specific locations on the genome, or deal with only a few genes associated with disease. There is no combined resource to widely support gene-, SNP-, and disease-related information, and to capture relationships among such data. Therefore, we developed an integrated database-pipeline system for studying SNPs and diseases.
To implement the pipeline system for the integrated database, we first unified complicated and redundant disease terms and gene names using the Unified Medical Language System (UMLS) for classification and noun modification, and the HUGO Gene Nomenclature Committee (HGNC) and NCBI gene databases. Next, we collected and integrated representative databases for three categories of information. For genes and proteins, we examined the NCBI mRNA, UniProt, UCSC Table Track and MitoDat databases. For genetic variants we used the dbSNP, JSNP, ALFRED, and HGVbase databases. For disease, we employed OMIM, GAD, and HGMD databases. The database-pipeline system provides a disease thesaurus, including genes and SNPs associated with disease. The search results for these categories are available on the web page http://diseasome.kobic.re.kr/, and a genome browser is also available to highlight findings, as well as to permit the convenient review of potentially deleterious SNPs among genes strongly associated with specific diseases and clinical phenotypes.
Our system is designed to capture the relationships between SNPs associated with disease and disease-causing genes. The integrated database-pipeline provides a list of candidate genes and SNP markers for evaluation in both epidemiological and molecular biological approaches to diseases-gene association studies. Furthermore, researchers then can decide semi-automatically the data set for association studies while considering the relationships between genetic variation and diseases. The database can also be economical for disease-association studies, as well as to facilitate an understanding of the processes which cause disease. Currently, the database contains 14,674 SNP records and 109,715 gene records associated with human diseases and it is updated at regular intervals.
研究疾病与单核苷酸多态性(SNP)等基因变异之间的关系非常重要。基因变异可通过影响重要的生物调节过程导致疾病。尽管需要分析SNP与疾病的相关性,但大多数现有数据库仅提供基因组特定位置的功能变异信息,或仅处理少数与疾病相关的基因。目前尚无综合资源能广泛支持与基因、SNP和疾病相关的信息,并捕捉这些数据之间的关系。因此,我们开发了一个用于研究SNP与疾病的综合数据库 - 管道系统。
为实现综合数据库的管道系统,我们首先使用统一医学语言系统(UMLS)进行分类和名词修饰,并借助HUGO基因命名委员会(HGNC)和NCBI基因数据库,统一复杂且冗余的疾病术语和基因名称。接下来,我们收集并整合了三类信息的代表性数据库。对于基因和蛋白质,我们研究了NCBI mRNA、UniProt、UCSC Table Track和MitoDat数据库。对于基因变异,我们使用了dbSNP、JSNP、ALFRED和HGVbase数据库。对于疾病,我们采用了OMIM、GAD和HGMD数据库。该数据库 - 管道系统提供了一个疾病词库,包括与疾病相关的基因和SNP。这些类别的搜索结果可在网页http://diseasome.kobic.re.kr/上获取,同时还提供了一个基因组浏览器,用于突出显示研究结果,并便于查看与特定疾病和临床表型密切相关的基因中潜在有害的SNP。
我们的系统旨在捕捉与疾病相关的SNP和致病基因之间的关系。该综合数据库 - 管道为疾病 - 基因关联研究的流行病学和分子生物学方法评估提供了候选基因和SNP标记列表。此外,研究人员在考虑基因变异与疾病之间的关系时,可以半自动地确定关联研究的数据集。该数据库对于疾病关联研究也较为经济实惠,有助于理解导致疾病产生的过程。目前,该数据库包含14,674条与人类疾病相关的SNP记录和109,715条基因记录,并定期更新。