Cormack James, Nath Chinmoy, Milward David, Raja Kalpana, Jonnalagadda Siddhartha R
Linguamatics Ltd., 324 Cambridge Science Park, Milton Road, Cambridge CB4 0WG, UK.
Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, 750 N. Lake Shore Drive, 11th Floor, Chicago, IL 60611, USA.
J Biomed Inform. 2015 Dec;58 Suppl(0):S120-S127. doi: 10.1016/j.jbi.2015.06.030. Epub 2015 Jul 22.
This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system.
本文描述了使用一个敏捷文本挖掘平台(Linguamatics的交互式信息提取平台I2E)来提取患者记录中符合i2b2/UTHealth 2014挑战赛定义的文档级心脏危险因素。该方法采用数据驱动的基于规则的方法,并添加了一个简单的监督分类器。我们证明,敏捷文本挖掘能够快速优化提取策略,而后处理可以利用注释指南、语料库统计信息以及从金标准数据推断出的逻辑。我们还展示了训练集中的数据不平衡如何影响性能。在测试数据上对该方法进行评估,得到的F值为91.7%,比表现最佳的系统落后1%。