Singhal Ayush, Leaman Robert, Catlett Natalie, Lemberger Thomas, McEntyre Johanna, Polson Shawn, Xenarios Ioannis, Arighi Cecilia, Lu Zhiyong
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Selventa, Cambridge, MA 02140, USA.
Database (Oxford). 2016 Dec 26;2016. doi: 10.1093/database/baw161. Print 2016.
Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system 'accuracy' remains a challenge and identify several additional common difficulties and potential research directions including (i) the 'scalability' issue due to the increasing need of mining information from millions of full-text articles, (ii) the 'interoperability' issue of integrating various text-mining systems into existing curation workflows and (iii) the 'reusability' issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. Finally, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators.
生物医学领域的文本挖掘正在迅速从小规模评估转向大规模应用。在本文中,我们认为文本挖掘技术已成为现实世界生物医学研究中的重要工具。我们描述了文本挖掘的四个大规模应用,这些应用在最近的生物创意V挑战赛研讨会上的小组讨论中得到了展示。我们将这些应用作为案例研究,以描述将文本挖掘技术成功应用于实际生物编目需求的常见要求。我们注意到系统“准确性”仍然是一个挑战,并确定了几个其他常见困难和潜在研究方向,包括:(i)由于从数百万篇全文文章中挖掘信息的需求不断增加而产生的“可扩展性”问题;(ii)将各种文本挖掘系统集成到现有编目工作流程中的“互操作性”问题;以及(iii)将经过训练的系统应用于开发过程中未曾见过的文本类型时遇到的“可重用性”问题。然后,我们描述了文本挖掘社区内的相关工作,特别关注生物创意系列挑战赛研讨会。我们相信,关注这项工作中确定的近期挑战将扩大持续采用文本挖掘工具所带来的机会。最后,为了维持编目生态系统并使文本挖掘系统因实际效益而被采用,我们呼吁文本挖掘研究人员与包括研究人员、出版商和生物编目人员在内的各种利益相关者加强合作。