The School of Information, Communication and Computer Technologies, Sirindhorn International Institute of Technology, Thammasat University, Pathum Thani 12120, Thailand.
The School of Knowledge Science, Japan Advanced Institute of Science and Technology, Nomi 923-1292, Japan.
J Healthc Eng. 2017;2017:7575280. doi: 10.1155/2017/7575280. Epub 2017 Sep 26.
Information extraction and knowledge discovery regarding adverse drug reaction (ADR) from large-scale clinical texts are very useful and needy processes. Two major difficulties of this task are the lack of domain experts for labeling examples and intractable processing of unstructured clinical texts. Even though most previous works have been conducted on these issues by applying semisupervised learning for the former and a word-based approach for the latter, they face with complexity in an acquisition of initial labeled data and ignorance of structured sequence of natural language. In this study, we propose automatic data labeling by distant supervision where knowledge bases are exploited to assign an relation label for each drug-event pair in texts, and then, we use patterns for characterizing ADR relation. The multiple-instance learning with expectation-maximization method is employed to estimate model parameters. The method applies transductive learning to iteratively reassign a probability of unknown drug-event pair at the training time. By investigating experiments with 50,998 discharge summaries, we evaluate our method by varying large number of parameters, that is, pattern types, pattern-weighting models, and initial and iterative weightings of relations for unlabeled data. Based on evaluations, our proposed method outperforms the word-based feature for NB-EM (iEM), MILR, and TSVM with F1 score of 11.3%, 9.3%, and 6.5% improvement, respectively.
从大规模临床文本中提取和发现关于药物不良反应 (ADR) 的信息是非常有用和必要的过程。这项任务的两个主要难点是缺乏用于标记示例的领域专家和难以处理非结构化的临床文本。尽管大多数先前的工作都通过应用半监督学习解决了前者,通过基于词的方法解决了后者来解决这些问题,但它们在获取初始标记数据的复杂性和对自然语言结构化序列的忽视方面面临挑战。在本研究中,我们通过远程监督提出了自动数据标记,其中利用知识库为文本中的每个药物-事件对分配关系标签,然后使用模式来描述 ADR 关系。采用具有期望最大化方法的多实例学习来估计模型参数。该方法将归纳学习应用于在训练时迭代地重新分配未知药物-事件对的概率。通过对 50998 份出院总结进行实验,我们通过大量参数的变化来评估我们的方法,即模式类型、模式加权模型以及未标记数据的关系的初始和迭代加权。基于评估,我们提出的方法在 F1 分数上优于基于词的特征的 NB-EM(iEM)、MILR 和 TSVM,分别提高了 11.3%、9.3%和 6.5%。