Shah Nigam H, Jonquet Clement, Chiang Annie P, Butte Atul J, Chen Rong, Musen Mark A
Centre for Biomedical Informatics, School of Medicine, Stanford University, Stanford, CA 94305, USA.
BMC Bioinformatics. 2009 Feb 5;10 Suppl 2(Suppl 2):S1. doi: 10.1186/1471-2105-10-S2-S1.
The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample. These annotations are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. We have previously developed methods to map text-annotations of tissue microarrays to concepts in the NCI thesaurus and SNOMED-CT. In this work we generalize our methods to map text annotations of gene expression datasets to concepts in the UMLS. We demonstrate the utility of our methods by processing annotations of datasets in the Gene Expression Omnibus. We demonstrate that we enable ontology-based querying and integration of tissue and gene expression microarray data. We enable identification of datasets on specific diseases across both repositories. Our approach provides the basis for ontology-driven data integration for translational research on gene and protein expression data. Based on this work we have built a prototype system for ontology based annotation and indexing of biomedical data. The system processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The key functionality of this system is to enable users to locate biomedical data resources related to particular ontology concepts.
公开可用的基因组规模数据量正在增加。公共存储库中的基因组数据集带有描述所研究样本病理状态的自由文本字段注释。这些注释未映射到任何本体中的概念,使得跨存储库整合这些数据集变得困难。我们之前已经开发了将组织微阵列的文本注释映射到NCI词库和SNOMED-CT中概念的方法。在这项工作中,我们将我们的方法进行推广,以将基因表达数据集的文本注释映射到UMLS中的概念。我们通过处理基因表达综合数据库中数据集的注释来证明我们方法的实用性。我们证明我们能够实现基于本体的组织和基因表达微阵列数据的查询与整合。我们能够识别两个存储库中关于特定疾病的数据集。我们的方法为基于本体的数据整合提供了基础,用于基因和蛋白质表达数据的转化研究。基于这项工作,我们构建了一个用于基于本体的生物医学数据注释和索引的原型系统。该系统处理各种资源元素的文本元数据,如基因表达数据集、放射学图像描述、临床试验报告和PubMed文章摘要,以便用适当本体中的概念对它们进行注释和索引。该系统的关键功能是使用户能够定位与特定本体概念相关的生物医学数据资源。