Medical Informatics Division, Case Western Reserve University, OH, USA.
J Biomed Inform. 2012 Oct;45(5):827-34. doi: 10.1016/j.jbi.2012.04.011. Epub 2012 Apr 27.
An important task in pharmacogenomics (PGx) studies is to identify genetic variants that may impact drug response. The success of many systematic and integrative computational approaches for PGx studies depends on the availability of accurate, comprehensive and machine understandable drug-gene relationship knowledge bases. Scientific literature is one of the most comprehensive knowledge sources for PGx-specific drug-gene relationships. However, the major barrier in accessing this information is that the knowledge is buried in a large amount of free text with limited machine understandability. Therefore there is a need to develop automatic approaches to extract structured PGx-specific drug-gene relationships from unstructured free text literature. In this study, we have developed a conditional relationship extraction approach to extract PGx-specific drug-gene pairs from 20 million MEDLINE abstracts using known drug-gene pairs as prior knowledge. We have demonstrated that the conditional drug-gene relationship extraction approach significantly improves the precision and F1 measure compared to the unconditioned approach (precision: 0.345 vs. 0.11; recall: 0.481 vs. 1.00; F1: 0.402 vs. 0.201). In this study, a method based on co-occurrence is used as the underlying relationship extraction method for its simplicity. It can be replaced by or combined with more advanced methods such as machine learning or natural language processing approaches to further improve the performance of the drug-gene relationship extraction from free text. Our method is not limited to extracting a drug-gene relationship; it can be generalized to extract other types of relationships when related background knowledge bases exist.
在药物基因组学(PGx)研究中,一个重要任务是识别可能影响药物反应的遗传变异。许多系统和综合的计算方法在 PGx 研究中的成功与否取决于是否有准确、全面且易于机器理解的药物-基因关系知识库。科学文献是 PGx 特定药物-基因关系最全面的知识来源之一。然而,获取这些信息的主要障碍是,这些知识埋藏在大量的自由文本中,机器理解能力有限。因此,需要开发自动方法从非结构化的自由文本文献中提取结构化的 PGx 特定药物-基因关系。在这项研究中,我们开发了一种条件关系提取方法,使用已知的药物-基因对作为先验知识,从 2000 万篇 MEDLINE 摘要中提取 PGx 特定的药物-基因对。我们证明,与非条件方法相比,条件药物-基因关系提取方法显著提高了精度和 F1 度量(精度:0.345 与 0.11;召回率:0.481 与 1.00;F1:0.402 与 0.201)。在这项研究中,基于共现的方法被用作基础关系提取方法,因为它简单。它可以被更先进的方法(如机器学习或自然语言处理方法)取代或结合,以进一步提高从自由文本中提取药物-基因关系的性能。我们的方法不仅限于提取药物-基因关系;当存在相关的背景知识库时,它可以推广到提取其他类型的关系。