Suppr超能文献

癌症相关信息的文本挖掘:现状与未来方向综述

Text mining of cancer-related information: review of current status and future directions.

作者信息

Spasić Irena, Livsey Jacqueline, Keane John A, Nenadić Goran

机构信息

School of Computer Science & Informatics, Cardiff University, Cardiff CF24 3AA, UK.

Clinical Outcomes Unit, The Christie NHS Foundation Trust, Manchester M20 4BX, UK.

出版信息

Int J Med Inform. 2014 Sep;83(9):605-23. doi: 10.1016/j.ijmedinf.2014.06.009. Epub 2014 Jun 24.

Abstract

PURPOSE

This paper reviews the research literature on text mining (TM) with the aim to find out (1) which cancer domains have been the subject of TM efforts, (2) which knowledge resources can support TM of cancer-related information and (3) to what extent systems that rely on knowledge and computational methods can convert text data into useful clinical information. These questions were used to determine the current state of the art in this particular strand of TM and suggest future directions in TM development to support cancer research.

METHODS

A review of the research on TM of cancer-related information was carried out. A literature search was conducted on the Medline database as well as IEEE Xplore and ACM digital libraries to address the interdisciplinary nature of such research. The search results were supplemented with the literature identified through Google Scholar.

RESULTS

A range of studies have proven the feasibility of TM for extracting structured information from clinical narratives such as those found in pathology or radiology reports. In this article, we provide a critical overview of the current state of the art for TM related to cancer. The review highlighted a strong bias towards symbolic methods, e.g. named entity recognition (NER) based on dictionary lookup and information extraction (IE) relying on pattern matching. The F-measure of NER ranges between 80% and 90%, while that of IE for simple tasks is in the high 90s. To further improve the performance, TM approaches need to deal effectively with idiosyncrasies of the clinical sublanguage such as non-standard abbreviations as well as a high degree of spelling and grammatical errors. This requires a shift from rule-based methods to machine learning following the success of similar trends in biological applications of TM. Machine learning approaches require large training datasets, but clinical narratives are not readily available for TM research due to privacy and confidentiality concerns. This issue remains the main bottleneck for progress in this area. In addition, there is a need for a comprehensive cancer ontology that would enable semantic representation of textual information found in narrative reports.

摘要

目的

本文回顾了文本挖掘(TM)的研究文献,旨在找出(1)哪些癌症领域是TM研究的主题,(2)哪些知识资源可以支持癌症相关信息的TM研究,以及(3)依赖知识和计算方法的系统在多大程度上能够将文本数据转化为有用的临床信息。这些问题用于确定TM这一特定领域的当前技术水平,并为支持癌症研究的TM发展提出未来方向。

方法

对癌症相关信息的TM研究进行了综述。在Medline数据库以及IEEE Xplore和ACM数字图书馆上进行了文献检索,以应对此类研究的跨学科性质。检索结果通过谷歌学术搜索到的文献进行补充。

结果

一系列研究证明了TM从临床叙述(如病理或放射学报告中的叙述)中提取结构化信息的可行性。在本文中,我们对与癌症相关的TM的当前技术水平进行了批判性综述。该综述突出了对符号方法的强烈偏向,例如基于字典查找的命名实体识别(NER)和依赖模式匹配的信息提取(IE)。NER的F值在80%至90%之间,而简单任务的IE的F值在90%以上。为了进一步提高性能,TM方法需要有效处理临床子语言的特性,如非标准缩写以及高度的拼写和语法错误。这需要从基于规则的方法转向机器学习,这是TM在生物学应用中类似趋势取得成功之后的发展方向。机器学习方法需要大量的训练数据集,但由于隐私和保密问题,临床叙述不易用于TM研究。这个问题仍然是该领域进展的主要瓶颈。此外,需要一个全面的癌症本体,以实现对叙述报告中发现的文本信息的语义表示。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验