Department of Medicine, 300 Pasteur Drive, Room S101, Mail Code 5110, Stanford University, Stanford, CA 94305, USA.
J Biomed Inform. 2010 Dec;43(6):1009-19. doi: 10.1016/j.jbi.2010.08.005. Epub 2010 Aug 17.
Most pharmacogenomics knowledge is contained in the text of published studies, and is thus not available for automated computation. Natural Language Processing (NLP) techniques for extracting relationships in specific domains often rely on hand-built rules and domain-specific ontologies to achieve good performance. In a new and evolving field such as pharmacogenomics (PGx), rules and ontologies may not be available. Recent progress in syntactic NLP parsing in the context of a large corpus of pharmacogenomics text provides new opportunities for automated relationship extraction. We describe an ontology of PGx relationships built starting from a lexicon of key pharmacogenomic entities and a syntactic parse of more than 87 million sentences from 17 million MEDLINE abstracts. We used the syntactic structure of PGx statements to systematically extract commonly occurring relationships and to map them to a common schema. Our extracted relationships have a 70-87.7% precision and involve not only key PGx entities such as genes, drugs, and phenotypes (e.g., VKORC1, warfarin, clotting disorder), but also critical entities that are frequently modified by these key entities (e.g., VKORC1 polymorphism, warfarin response, clotting disorder treatment). The result of our analysis is a network of 40,000 relationships between more than 200 entity types with clear semantics. This network is used to guide the curation of PGx knowledge and provide a computable resource for knowledge discovery.
大多数药物基因组学知识都包含在已发表研究的文本中,因此无法进行自动化计算。提取特定领域关系的自然语言处理 (NLP) 技术通常依赖于手工构建的规则和领域特定的本体论来实现良好的性能。在药物基因组学 (PGx) 等新兴和不断发展的领域,可能没有规则和本体论。句法 NLP 解析在大量药物基因组学文本语料库中的最新进展为自动化关系提取提供了新的机会。我们从关键药物基因组实体的词汇表和来自 1700 万篇 MEDLINE 摘要的超过 8700 万条句子的句法解析开始,构建了一个 PGx 关系本体论。我们使用 PGx 语句的句法结构系统地提取常见的关系,并将其映射到一个通用模式。我们提取的关系具有 70-87.7%的精度,不仅涉及基因、药物和表型等关键 PGx 实体(例如 VKORC1、华法林、凝血障碍),还涉及经常被这些关键实体修饰的关键实体(例如 VKORC1 多态性、华法林反应、凝血障碍治疗)。我们分析的结果是一个由 200 多种实体类型组成的 40000 个关系网络,具有明确的语义。该网络用于指导 PGx 知识的策展,并为知识发现提供可计算资源。