Rusanov Alexander, Miotto Riccardo, Weng Chunhua
Department of Anesthesiology, Columbia University, New York, New York, USA.
Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA.
JAMIA Open. 2018 Oct;1(2):283-293. doi: 10.1093/jamiaopen/ooy009. Epub 2018 Sep 4.
Traditionally, summarization of research themes and trends within a given discipline was accomplished by manual review of scientific works in the field. However, with the ushering in of the age of "big data," new methods for discovery of such information become necessary as traditional techniques become increasingly difficult to apply due to the exponential growth of document repositories. Our objectives are to develop a pipeline for unsupervised theme extraction and summarization of thematic trends in document repositories, and to test it by applying it to a specific domain.
To that end, we detail a pipeline, which utilizes machine learning and natural language processing for unsupervised theme extraction, and a novel method for summarization of thematic trends, and network mapping for visualization of thematic relations. We then apply this pipeline to a collection of anesthesiology abstracts.
We demonstrate how this pipeline enables discovery of major themes and temporal trends in anesthesiology research and facilitates document classification and corpus exploration.
The relation of prevalent topics and extracted trends to recent events in both anesthesiology, and healthcare in general, demonstrates the pipeline's utility. Furthermore, the agreement between the unsupervised thematic grouping and human-assigned classification validates the pipeline's accuracy and demonstrates another potential use.
The described pipeline enables summarization and exploration of large document repositories, facilitates classification, aids in trend identification. A more robust and user-friendly interface will facilitate the expansion of this methodology to other domains. This will be the focus of future work for our group.
传统上,给定学科内研究主题和趋势的总结是通过对该领域科学著作进行人工审阅来完成的。然而,随着“大数据”时代的到来,由于文献库呈指数级增长,传统技术越来越难以应用,因此需要新的方法来发现此类信息。我们的目标是开发一个用于文档库中无监督主题提取和主题趋势总结的流程,并将其应用于特定领域进行测试。
为此,我们详细介绍了一个流程,该流程利用机器学习和自然语言处理进行无监督主题提取,以及一种新颖的主题趋势总结方法和用于主题关系可视化的网络映射。然后,我们将此流程应用于一组麻醉学摘要。
我们展示了该流程如何能够发现麻醉学研究中的主要主题和时间趋势,并促进文档分类和语料库探索。
流行主题和提取趋势与麻醉学以及一般医疗保健领域近期事件的关系证明了该流程的实用性。此外,无监督主题分组与人工分配分类之间的一致性验证了该流程的准确性,并展示了另一种潜在用途。
所描述的流程能够对大型文档库进行总结和探索,促进分类,有助于趋势识别。一个更强大且用户友好的界面将有助于将此方法扩展到其他领域。这将是我们团队未来工作的重点。