Medical Informatics Program, Center for Clinical Investigation, Case Western Reserve University, Cleveland, Ohio, USA.
J Am Med Inform Assoc. 2014 Jan-Feb;21(1):90-6. doi: 10.1136/amiajnl-2012-001584. Epub 2013 May 18.
A comprehensive and machine-understandable cancer drug-side effect (drug-SE) relationship knowledge base is important for in silico cancer drug target discovery, drug repurposing, and toxicity predication, and for personalized risk-benefit decisions by cancer patients. While US Food and Drug Administration (FDA) drug labels capture well-known cancer drug SE information, much cancer drug SE knowledge remains buried the published biomedical literature. We present a relationship extraction approach to extract cancer drug-SE pairs from the literature.
We used 21,354,075 MEDLINE records as the text corpus. We extracted drug-SE co-occurrence pairs using a cancer drug lexicon and a clean SE lexicon that we created. We then developed two filtering approaches to remove drug-disease treatment pairs and subsequently a ranking scheme to further prioritize filtered pairs. Finally, we analyzed relationships among SEs, gene targets, and indications.
We extracted 56,602 cancer drug-SE pairs. The filtering algorithms improved the precision of extracted pairs from 0.252 at baseline to 0.426, representing a 69% improvement in precision with no decrease in recall. The ranking algorithm further prioritized filtered pairs and achieved a precision of 0.778 for top-ranked pairs. We showed that cancer drugs that share SEs tend to have overlapping gene targets and overlapping indications.
The relationship extraction approach is effective in extracting many cancer drug-SE pairs from the literature. This unique knowledge base, when combined with existing cancer drug SE knowledge, can facilitate drug target discovery, drug repurposing, and toxicity prediction.
全面且可被机器理解的癌症药物副作用(药物-SE)关系知识库对于计算机辅助癌症药物靶点发现、药物再利用以及毒性预测,以及癌症患者的个性化风险-获益决策都非常重要。虽然美国食品和药物管理局(FDA)的药物标签能够很好地捕捉到已知的癌症药物 SE 信息,但许多癌症药物 SE 知识仍隐藏在已发表的生物医学文献中。我们提出了一种关系提取方法,从文献中提取癌症药物-SE 对。
我们使用了 21354075 条 MEDLINE 记录作为文本语料库。我们使用癌症药物词典和我们创建的干净 SE 词典提取药物-SE 共现对。然后,我们开发了两种过滤方法来去除药物-疾病治疗对,随后使用一种排名方案来进一步优先考虑过滤后的对。最后,我们分析了 SE、基因靶点和适应症之间的关系。
我们提取了 56602 对癌症药物-SE。过滤算法将提取对的精度从基线时的 0.252 提高到 0.426,精度提高了 69%,而召回率没有下降。排名算法进一步对过滤后的对进行了优先级排序,对于排名靠前的对,精度达到了 0.778。我们表明,具有共同 SE 的癌症药物往往具有重叠的基因靶点和重叠的适应症。
关系提取方法能够有效地从文献中提取出许多癌症药物-SE 对。这个独特的知识库,与现有的癌症药物 SE 知识相结合,可以促进药物靶点发现、药物再利用以及毒性预测。