默克基因索引浏览器：一个用于基因发现、基因特征描述及EST数据挖掘的可扩展数据整合系统。

The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and EST data mining.

作者信息

Eckman B A, Aaronson J S, Borkowski J A, Bailey W J, Elliston K O, Williamson A R, Blevins R A

机构信息

Department of Bioinformatics, Merck Research Laboratories, West Point, PA, USA.

出版信息

Bioinformatics. 1998;14(1):2-13. doi: 10.1093/bioinformatics/14.1.2.

DOI:10.1093/bioinformatics/14.1.2

PMID:9520496

Abstract

MOTIVATION

To make effective use of the vast amounts of expressed sequence tag (EST) sequence data generated by the Merck-sponsored EST project and other similar efforts, sequences must be organized into gene classes, and scientists must be able to 'mine' the gene class data in the context of related genomic data.

RESULTS

This paper presents the Merck Gene Index browser, an easily extensible, World Wide Web-based system for mining the Merck Gene Index (MGI) and related genomic data. The MGI is a non-redundant set of clones and sequences, each representing a distinct gene, constructed from all high-quality 3' EST sequences generated by the Merck-sponsored EST project. The MGI browser integrates data from a variety of sources and storage formats, both local and remote, using an eclectic integration strategy, including a federation of relational databases, a local data warehouse and simple hypertext links. Data currently integrated include: LENS cDNA clone and EST data, dbEST protein and non-EST nucleic acid similarity data, WashU sequence chromatograms. Entrez sequence and Medline entries, and UniGene gene clusters. Flatfile sequence data are accessed using the Bioapps server, an internally developed client-server system that supports generic sequence analysis applications. Browser data are retrieved and formatted by means of the Bioinformatics Data Integration Toolkit (B-DIT), a new suite of Perl routines.

摘要

动机

为了有效利用默克公司赞助的EST项目及其他类似项目所产生的大量表达序列标签（EST）序列数据，必须将序列组织成基因类别，并且科学家必须能够在相关基因组数据的背景下“挖掘”基因类别数据。

结果

本文介绍了默克基因索引浏览器，这是一个易于扩展的、基于万维网的系统，用于挖掘默克基因索引（MGI）及相关基因组数据。MGI是一组非冗余的克隆和序列，每个代表一个独特的基因，由默克公司赞助的EST项目产生的所有高质量3' EST序列构建而成。MGI浏览器使用一种兼收并蓄的整合策略，包括关系数据库联合、本地数据仓库和简单超文本链接，整合来自各种本地和远程数据源及存储格式的数据。目前整合的数据包括：LENS cDNA克隆和EST数据、dbEST蛋白质和非EST核酸相似性数据、华盛顿大学序列色谱图、Entrez序列和Medline条目以及UniGene基因簇。使用Bioapps服务器访问平面文件序列数据，Bioapps是一个内部开发的客户端 - 服务器系统，支持通用序列分析应用程序。浏览器数据通过生物信息学数据整合工具包（B - DIT）检索和格式化，B - DIT是一套新的Perl程序。