Jena University Language and Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany.
Brief Bioinform. 2012 Jul;13(4):460-94. doi: 10.1093/bib/bbs018.
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.
本文调查了 2008 年至 2011 年间药物基因组学文献的文本挖掘工作。药物基因组学(或药物遗传学)是研究人类遗传变异如何影响药物反应的领域。因此,出版物涵盖了基因型、表型和药理学研究的交叉领域,这一主题近年来越来越成为活跃研究的焦点。本调查涵盖了自动识别相关命名实体(例如基因、基因变体和蛋白质、疾病和其他病理现象、与医疗相关的药物和其他化学物质)以及它们之间各种形式关系的努力。考虑了广泛的文本类型,例如科学出版物(摘要以及全文)、专利文本和临床叙述。我们还讨论了高级文本分析所需的基础设施和资源,例如标注有相应语义元数据(黄金标准和训练数据)的文档语料库、提供不同形式化和特定程度领域背景知识的生物医学术语和本体、用于构建复杂和可扩展的文本分析管道的软件架构以及基于它们的 Web 服务,以及传播和交互通常由文本挖掘工具提取的大量半正式知识结构的综合方法。最后,我们考虑了药物基因组学文本挖掘领域已经开发的一些新应用,并指出了未来研究的方向。