癌症相关信息的文本挖掘：现状与未来方向综述

Text mining of cancer-related information: review of current status and future directions.

作者信息

Spasić Irena, Livsey Jacqueline, Keane John A, Nenadić Goran

机构信息

School of Computer Science & Informatics, Cardiff University, Cardiff CF24 3AA, UK.

Clinical Outcomes Unit, The Christie NHS Foundation Trust, Manchester M20 4BX, UK.

出版信息

Int J Med Inform. 2014 Sep;83(9):605-23. doi: 10.1016/j.ijmedinf.2014.06.009. Epub 2014 Jun 24.

DOI:10.1016/j.ijmedinf.2014.06.009

PMID:25008281

Abstract

PURPOSE

This paper reviews the research literature on text mining (TM) with the aim to find out (1) which cancer domains have been the subject of TM efforts, (2) which knowledge resources can support TM of cancer-related information and (3) to what extent systems that rely on knowledge and computational methods can convert text data into useful clinical information. These questions were used to determine the current state of the art in this particular strand of TM and suggest future directions in TM development to support cancer research.

METHODS

A review of the research on TM of cancer-related information was carried out. A literature search was conducted on the Medline database as well as IEEE Xplore and ACM digital libraries to address the interdisciplinary nature of such research. The search results were supplemented with the literature identified through Google Scholar.

RESULTS

A range of studies have proven the feasibility of TM for extracting structured information from clinical narratives such as those found in pathology or radiology reports. In this article, we provide a critical overview of the current state of the art for TM related to cancer. The review highlighted a strong bias towards symbolic methods, e.g. named entity recognition (NER) based on dictionary lookup and information extraction (IE) relying on pattern matching. The F-measure of NER ranges between 80% and 90%, while that of IE for simple tasks is in the high 90s. To further improve the performance, TM approaches need to deal effectively with idiosyncrasies of the clinical sublanguage such as non-standard abbreviations as well as a high degree of spelling and grammatical errors. This requires a shift from rule-based methods to machine learning following the success of similar trends in biological applications of TM. Machine learning approaches require large training datasets, but clinical narratives are not readily available for TM research due to privacy and confidentiality concerns. This issue remains the main bottleneck for progress in this area. In addition, there is a need for a comprehensive cancer ontology that would enable semantic representation of textual information found in narrative reports.

摘要

目的

本文回顾了文本挖掘（TM）的研究文献，旨在找出（1）哪些癌症领域是TM研究的主题，（2）哪些知识资源可以支持癌症相关信息的TM研究，以及（3）依赖知识和计算方法的系统在多大程度上能够将文本数据转化为有用的临床信息。这些问题用于确定TM这一特定领域的当前技术水平，并为支持癌症研究的TM发展提出未来方向。

方法

对癌症相关信息的TM研究进行了综述。在Medline数据库以及IEEE Xplore和ACM数字图书馆上进行了文献检索，以应对此类研究的跨学科性质。检索结果通过谷歌学术搜索到的文献进行补充。

结果

一系列研究证明了TM从临床叙述（如病理或放射学报告中的叙述）中提取结构化信息的可行性。在本文中，我们对与癌症相关的TM的当前技术水平进行了批判性综述。该综述突出了对符号方法的强烈偏向，例如基于字典查找的命名实体识别（NER）和依赖模式匹配的信息提取（IE）。NER的F值在80%至90%之间，而简单任务的IE的F值在90%以上。为了进一步提高性能，TM方法需要有效处理临床子语言的特性，如非标准缩写以及高度的拼写和语法错误。这需要从基于规则的方法转向机器学习，这是TM在生物学应用中类似趋势取得成功之后的发展方向。机器学习方法需要大量的训练数据集，但由于隐私和保密问题，临床叙述不易用于TM研究。这个问题仍然是该领域进展的主要瓶颈。此外，需要一个全面的癌症本体，以实现对叙述报告中发现的文本信息的语义表示。

相似文献

Text mining of cancer-related information: review of current status and future directions.癌症相关信息的文本挖掘：现状与未来方向综述

Int J Med Inform. 2014 Sep;83(9):605-23. doi: 10.1016/j.ijmedinf.2014.06.009. Epub 2014 Jun 24.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Extracting adverse drug events from clinical Notes: A systematic review of approaches used.从临床记录中提取药物不良事件：对所用方法的系统评价

J Biomed Inform. 2024 Mar;151:104603. doi: 10.1016/j.jbi.2024.104603. Epub 2024 Feb 6.

Developing a RadLex-Based Named Entity Recognition Tool for Mining Textual Radiology Reports: Development and Performance Evaluation Study.基于 RadLex 的命名实体识别工具在挖掘文本放射学报告中的开发：开发和性能评估研究。

J Med Internet Res. 2021 Oct 29;23(10):e25378. doi: 10.2196/25378.

Large Language Model Applications for Health Information Extraction in Oncology: Scoping Review.用于肿瘤学健康信息提取的大语言模型应用：范围综述

JMIR Cancer. 2025 Mar 28;11:e65984. doi: 10.2196/65984.

Information extraction from multi-institutional radiology reports.从多机构放射学报告中提取信息。

Artif Intell Med. 2016 Jan;66:29-39. doi: 10.1016/j.artmed.2015.09.007. Epub 2015 Oct 3.

Knowledge based word-concept model estimation and refinement for biomedical text mining.用于生物医学文本挖掘的基于知识的词概念模型估计与优化。

J Biomed Inform. 2015 Feb;53:300-7. doi: 10.1016/j.jbi.2014.11.015. Epub 2014 Dec 12.

Text mining in livestock animal science: introducing the potential of text mining to animal sciences.文本挖掘在畜牧动物科学中的应用：介绍文本挖掘在动物科学中的应用潜力。

J Anim Sci. 2012 Oct;90(10):3666-76. doi: 10.2527/jas.2011-4841. Epub 2012 Jun 4.

TaggerOne: joint named entity recognition and normalization with semi-Markov Models.TaggerOne：使用半马尔可夫模型进行联合命名实体识别与归一化

Bioinformatics. 2016 Sep 15;32(18):2839-46. doi: 10.1093/bioinformatics/btw343. Epub 2016 Jun 9.

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.利用合成医疗保健数据借助大语言模型进行命名实体识别：开发与验证研究。

J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.

引用本文的文献

Artificial intelligence applied to diabetes complications: a bibliometric analysis.应用于糖尿病并发症的人工智能：一项文献计量分析。

Front Artif Intell. 2025 Jan 31;8:1455341. doi: 10.3389/frai.2025.1455341. eCollection 2025.

Healthc Inform Res. 2024 Oct;30(4):398-408. doi: 10.4258/hir.2024.30.4.398. Epub 2024 Oct 31.

NSSC: a neuro-symbolic AI system for enhancing accuracy of named entity recognition and linking from oncologic clinical notes.NSSC：一种用于提高肿瘤临床记录中命名实体识别和链接准确性的神经符号人工智能系统。

Med Biol Eng Comput. 2025 Mar;63(3):749-772. doi: 10.1007/s11517-024-03227-4. Epub 2024 Nov 1.

Detection of Medication Mentions and Medication Change Events in Clinical Notes Using Transformer-Based Models.基于转换器模型的临床记录中药物提及和药物变更事件的检测。

Stud Health Technol Inform. 2024 Jan 25;310:685-689. doi: 10.3233/SHTI231052.

Infrastructure tools to support an effective Radiation Oncology Learning Health System.支持有效的放射肿瘤学学习健康系统的基础设施工具。

J Appl Clin Med Phys. 2023 Oct;24(10):e14127. doi: 10.1002/acm2.14127. Epub 2023 Aug 25.

Natural language processing in urology: Automated extraction of clinical information from histopathology reports of uro-oncology procedures.泌尿外科中的自然语言处理：从泌尿肿瘤手术组织病理学报告中自动提取临床信息

Heliyon. 2023 Mar 24;9(4):e14793. doi: 10.1016/j.heliyon.2023.e14793. eCollection 2023 Apr.

Analysis of Risk Factors of Coal Chemical Enterprises Based on Text Mining.基于文本挖掘的煤化工企业风险因素分析。

J Environ Public Health. 2023 Jan 28;2023:4181159. doi: 10.1155/2023/4181159. eCollection 2023.

Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science.自动化从历史文本中提取信息，并为生态与保护科学领域构建一个关联数据模型。

Heliyon. 2022 Oct 4;8(10):e10710. doi: 10.1016/j.heliyon.2022.e10710. eCollection 2022 Oct.

Transforming Thyroid Cancer Diagnosis and Staging Information from Unstructured Reports to the Observational Medical Outcome Partnership Common Data Model.将甲状腺癌诊断和分期信息从非结构化报告转化为观察性医疗结局伙伴关系通用数据模型。

Appl Clin Inform. 2022 May;13(3):521-531. doi: 10.1055/s-0042-1748144. Epub 2022 Jun 15.

Artificial Intelligence and Machine Learning in Cancer Research: A Systematic and Thematic Analysis of the Top 100 Cited Articles Indexed in Scopus Database.人工智能和机器学习在癌症研究中的应用：Scopus 数据库中被引前 100 篇文章的系统和主题分析。

Cancer Control. 2022 Jan-Dec;29:10732748221095946. doi: 10.1177/10732748221095946.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

癌症相关信息的文本挖掘：现状与未来方向综述

Text mining of cancer-related information: review of current status and future directions.

作者信息

机构信息

出版信息

PURPOSE

METHODS

RESULTS

目的

方法

结果

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献