Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA.
Harvard Medical School, Boston, MA.
AMIA Annu Symp Proc. 2021 Jan 25;2020:658-667. eCollection 2020.
Information extraction (IE), the distillation of specific information from unstructured data, is a core task in natural language processing. For rare entities (<1% prevalence), collection of positive examples required to train a model may require an infeasibly large sample of mostly negative ones. We combined unsupervised- with biased positive-unlabeled (PU) learning methods to: 1) facilitate positive example collection while maintaining the assumptions needed to 2) learn a binary classifier from the biased positive-unlabeled data alone. We tested the methods on a real-life use case of rare (<0.42%) entity extraction from medical malpractice documents. When tested on a manually reviewed random sample of documents, the PU model achieved an area under the precision-recall curve of0.283 and Fj of 0.410, outperforming fully supervised learning (0.022 and 0.096, respectively). The results demonstrate our method's potential to reduce the manual effort required for extracting rare entities from narrative texts.
信息抽取(IE),即从非结构化数据中提取特定信息,是自然语言处理的核心任务。对于罕见实体(<1%的患病率),为了训练模型而需要收集的阳性示例可能需要大量的主要为阴性的示例。我们结合了无监督和有偏的阳性未标记(PU)学习方法来:1)促进阳性示例的收集,同时保持从有偏的阳性未标记数据中学习二分类器所需的假设。我们在一个真实的罕见(<0.42%)实体提取的医疗事故文档的用例中测试了这些方法。在对人工审阅的随机文档样本进行测试时,PU 模型在精度-召回曲线下的面积达到 0.283,Fj 值达到 0.410,优于完全监督学习(分别为 0.022 和 0.096)。结果表明,我们的方法有潜力减少从叙述性文本中提取罕见实体所需的人工工作量。