Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA; Division of Health and Biomedical Informatics, Northwestern University, Chicago, IL, USA.
Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA.
J Biomed Inform. 2016 Dec;64:265-272. doi: 10.1016/j.jbi.2016.10.014. Epub 2016 Oct 27.
OBJECTIVES: Extracting data from publication reports is a standard process in systematic review (SR) development. However, the data extraction process still relies too much on manual effort which is slow, costly, and subject to human error. In this study, we developed a text summarization system aimed at enhancing productivity and reducing errors in the traditional data extraction process. METHODS: We developed a computer system that used machine learning and natural language processing approaches to automatically generate summaries of full-text scientific publications. The summaries at the sentence and fragment levels were evaluated in finding common clinical SR data elements such as sample size, group size, and PICO values. We compared the computer-generated summaries with human written summaries (title and abstract) in terms of the presence of necessary information for the data extraction as presented in the Cochrane review's study characteristics tables. RESULTS: At the sentence level, the computer-generated summaries covered more information than humans do for systematic reviews (recall 91.2% vs. 83.8%, p<0.001). They also had a better density of relevant sentences (precision 59% vs. 39%, p<0.001). At the fragment level, the ensemble approach combining rule-based, concept mapping, and dictionary-based methods performed better than individual methods alone, achieving an 84.7% F-measure. CONCLUSION: Computer-generated summaries are potential alternative information sources for data extraction in systematic review development. Machine learning and natural language processing are promising approaches to the development of such an extractive summarization system.
目的:从文献报告中提取数据是系统评价(SR)开发的标准流程。然而,数据提取过程仍然过于依赖人工,既缓慢、昂贵,又容易出错。在本研究中,我们开发了一种文本摘要系统,旨在提高传统数据提取过程的效率并减少错误。
方法:我们开发了一个计算机系统,该系统使用机器学习和自然语言处理方法自动生成全文科学出版物的摘要。在句子和片段级别评估摘要,以找到常见的临床 SR 数据元素,如样本量、组大小和 PICO 值。我们比较了计算机生成的摘要与人类编写的摘要(标题和摘要)在提取数据方面的信息完整性,这些信息在 Cochrane 综述的研究特征表中呈现。
结果:在句子级别上,计算机生成的摘要涵盖了比人类更全面的系统评价信息(召回率 91.2%比 83.8%,p<0.001)。它们还具有更高密度的相关句子(精度 59%比 39%,p<0.001)。在片段级别上,结合基于规则、概念映射和基于词典的方法的集成方法的表现优于单独使用的方法,F1 分数达到 84.7%。
结论:计算机生成的摘要可能是系统评价开发中数据提取的替代信息来源。机器学习和自然语言处理是开发这种提取式摘要系统的有前途的方法。
BMC Med Inform Decis Mak. 2020-12-15
Cochrane Database Syst Rev. 2022-2-1
J Biomed Inform. 2016-6
J Am Med Inform Assoc. 2021-9-18
Int J Med Inform. 2019-5-30
PeerJ Comput Sci. 2025-3-11
Front Public Health. 2025-1-30
BMC Med Res Methodol. 2025-1-15
J Biomed Inform. 2024-1
Med J Islam Repub Iran. 2023-9-4
Front Med (Lausanne). 2022-8-8
J Mach Learn Res. 2016
J Biomed Inform. 2016-6
J Biomed Inform. 2015-10
J Am Med Inform Assoc. 2015-9
J Biomed Inform. 2015-2
IEEE J Biomed Health Inform. 2016-9
J Biomed Inform. 2014-12
J Am Med Inform Assoc. 2014-2-27
J Biomed Inform. 2013-7-27
Med Image Comput Comput Assist Interv. 2011