Centro Nacional de Investigaciones Oncológicas, Biología Computacional y Estructural, Madrid, Spain.
Mol Inform. 2011 Jun;30(6-7):506-19. doi: 10.1002/minf.201100005. Epub 2011 Jul 12.
Providing prior knowledge about biological properties of chemicals, such as kinetic values, protein targets, or toxic effects, can facilitate many aspects of drug development. Chemical information is rapidly accumulating in all sorts of free text documents like patents, industry reports, or scientific articles, which has motivated the development of specifically tailored text mining applications. Despite the potential gains, chemical text mining still faces significant challenges. One of the most salient is the recognition of chemical entities mentioned in text. To help practitioners contribute to this area, a good portion of this review is devoted to this issue, and presents the basic concepts and principles underlying the main strategies. The technical details are introduced and accompanied by relevant bibliographic references. Other tasks discussed are retrieving relevant articles, identifying relationships between chemicals and other entities, or determining the chemical structures of chemicals mentioned in text. This review also introduces a number of published applications that can be used to build pipelines in topics like drug side effects, toxicity, and protein-disease-compound network analysis. We conclude the review with an outlook on how we expect the field to evolve, discussing its possibilities and its current limitations.
提供有关化学物质的生物学特性的先验知识,如动力学值、蛋白质靶标或毒性效应,可促进药物开发的许多方面。化学信息在各种自由文本文档中迅速积累,如专利、行业报告或科学文章,这促使开发了专门针对文本挖掘的应用程序。尽管有潜在的收益,但化学文本挖掘仍然面临着重大挑战。最突出的问题之一是识别文本中提到的化学实体。为了帮助从业者在这一领域做出贡献,本综述的很大一部分致力于解决这个问题,并介绍了主要策略背后的基本概念和原则。技术细节被引入,并附有相关的参考文献。讨论的其他任务包括检索相关文章、识别化学物质与其他实体之间的关系,或确定文本中提到的化学物质的化学结构。本综述还介绍了一些已发表的应用程序,可用于构建药物副作用、毒性和蛋白质-疾病-化合物网络分析等主题的管道。我们以对该领域未来发展的展望结束了综述,讨论了它的可能性及其当前的局限性。