Thompson Paul, Daikou Sophia, Ueno Kenju, Batista-Navarro Riza, Tsujii Jun'ichi, Ananiadou Sophia
National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK.
Artificial Intelligence Research Center, National Research and Development Agency (AIST), Tokyo Waterfront 2-3-2 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
J Cheminform. 2018 Aug 13;10(1):37. doi: 10.1186/s13321-018-0290-y.
Pharmacovigilance (PV) databases record the benefits and risks of different drugs, as a means to ensure their safe and effective use. Creating and maintaining such resources can be complex, since a particular medication may have divergent effects in different individuals, due to specific patient characteristics and/or interactions with other drugs being administered. Textual information from various sources can provide important evidence to curators of PV databases about the usage and effects of drug targets in different medical subjects. However, the efficient identification of relevant evidence can be challenging, due to the increasing volume of textual data. Text mining (TM) techniques can support curators by automatically detecting complex information, such as interactions between drugs, diseases and adverse effects. This semantic information supports the quick identification of documents containing information of interest (e.g., the different types of patients in which a given adverse drug reaction has been observed to occur). TM tools are typically adapted to different domains by applying machine learning methods to corpora that are manually labelled by domain experts using annotation guidelines to ensure consistency. We present a semantically annotated corpus of 597 MEDLINE abstracts, PHAEDRA, encoding rich information on drug effects and their interactions, whose quality is assured through the use of detailed annotation guidelines and the demonstration of high levels of inter-annotator agreement (e.g., 92.6% F-Score for identifying named entities and 78.4% F-Score for identifying complex events, when relaxed matching criteria are applied). To our knowledge, the corpus is unique in the domain of PV, according to the level of detail of its annotations. To illustrate the utility of the corpus, we have trained TM tools based on its rich labels to recognise drug effects in text automatically. The corpus and annotation guidelines are available at: http://www.nactem.ac.uk/PHAEDRA/ .
药物警戒(PV)数据库记录不同药物的益处和风险,作为确保其安全有效使用的一种手段。创建和维护此类资源可能很复杂,因为特定药物在不同个体中可能有不同的效果,这是由于特定的患者特征和/或与正在使用的其他药物的相互作用所致。来自各种来源的文本信息可以为PV数据库的管理者提供关于药物靶点在不同医学主题中的使用和效果的重要证据。然而,由于文本数据量的不断增加,有效识别相关证据可能具有挑战性。文本挖掘(TM)技术可以通过自动检测复杂信息(如药物、疾病和不良反应之间的相互作用)来支持管理者。这种语义信息有助于快速识别包含感兴趣信息的文档(例如,观察到特定药物不良反应发生的不同类型患者)。TM工具通常通过将机器学习方法应用于由领域专家使用注释指南进行手动标注以确保一致性的语料库来适应不同领域。我们展示了一个由597篇MEDLINE摘要组成的语义标注语料库PHAEDRA,它编码了关于药物效果及其相互作用的丰富信息,通过使用详细的注释指南和展示高水平的注释者间一致性(例如,应用宽松匹配标准时,识别命名实体的F值为92.6%,识别复杂事件的F值为78.4%)来确保其质量。据我们所知,根据其注释的详细程度,该语料库在PV领域是独一无二的。为了说明该语料库的实用性,我们基于其丰富的标签训练了TM工具,以自动识别文本中的药物效果。该语料库和注释指南可在以下网址获取:http://www.nactem.ac.uk/PHAEDRA/ 。