用于生物医学文本分类的卷积神经网络：在生物医学文章索引中的应用

Convolutional Neural Networks for Biomedical Text Classification: Application in Indexing Biomedical Articles.

作者信息

Rios Anthony, Kavuluru Ramakanth

机构信息

Department of Computer Science, University of Kentucky, Lexington, Kentucky.

Division of Biomedical Informatics, Depts. of Biostatistics and Computer Science, University of Kentucky, Lexington, Kentucky.

出版信息

ACM BCB. 2015 Sep;2015:258-267. doi: 10.1145/2808719.2808746.

DOI:10.1145/2808719.2808746

PMID:28736769

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5521984/

Abstract

Building high accuracy text classifiers is an important task in biomedicine given the wealth of information hidden in unstructured narratives such as research articles and clinical documents. Due to large feature spaces, traditionally, discriminative approaches such as logistic regression and support vector machines with n-gram and semantic features (e.g., named entities) have been used for text classification where additional performance gains are typically made through feature selection and ensemble approaches. In this paper, we demonstrate that a more direct approach using convolutional neural networks (CNNs) outperforms several traditional approaches in biomedical text classification with the specific use-case of assigning medical subject headings (or MeSH terms) to biomedical articles. Trained annotators at the national library of medicine (NLM) assign on an average 13 codes to each biomedical article, thus semantically indexing scientific literature to support NLM's PubMed search system. Recent evidence suggests that effective automated efforts for MeSH term assignment start with binary classifiers for each term. In this paper, we use CNNs to build binary text classifiers and achieve an absolute improvement of over 3% in macro F-score over a set of selected hard-to-classify MeSH terms when compared with the best prior results on a public dataset. Additional experiments on 50 high frequency terms in the dataset also show improvements with CNNs. Our results indicate the strong potential of CNNs in biomedical text classification tasks.

摘要

鉴于诸如研究文章和临床文档等非结构化叙述中隐藏着丰富的信息，构建高精度文本分类器是生物医学中的一项重要任务。由于特征空间较大，传统上，诸如逻辑回归和具有n元语法和语义特征（例如命名实体）的支持向量机等判别方法已被用于文本分类，其中通常通过特征选择和集成方法来进一步提高性能。在本文中，我们证明了一种使用卷积神经网络（CNN）的更直接方法在生物医学文本分类中优于几种传统方法，具体应用案例是为生物医学文章分配医学主题词（或MeSH词）。美国国立医学图书馆（NLM）的专业注释人员平均为每篇生物医学文章分配13个代码，从而对科学文献进行语义索引，以支持NLM的PubMed搜索系统。最近的证据表明，有效的MeSH词自动分配工作从每个词的二元分类器开始。在本文中，我们使用CNN构建二元文本分类器，与公共数据集上之前的最佳结果相比，在一组选定的难以分类的MeSH词上，宏观F值绝对提高了3%以上。对数据集中50个高频词的额外实验也显示了CNN的改进效果。我们的结果表明CNN在生物医学文本分类任务中具有强大的潜力。

相似文献

Convolutional Neural Networks for Biomedical Text Classification: Application in Indexing Biomedical Articles.

ACM BCB. 2015 Sep;2015:258-267. doi: 10.1145/2808719.2808746.

Automatic Assignment of Non-Leaf MeSH Terms to Biomedical Articles.

AMIA Annu Symp Proc. 2015 Nov 5;2015:697-706. eCollection 2015.

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.

BMC Bioinformatics. 2015 Apr 30;16:138. doi: 10.1186/s12859-015-0564-6.

Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII.

Database (Oxford). 2023 Mar 7;2023. doi: 10.1093/database/baad005.

Analyzing the Moving Parts of a Large-Scale Multi-Label Text Classification Pipeline: Experiences in Indexing Biomedical Articles.

Proc (IEEE Int Conf Healthc Inform). 2015 Oct;2015:1-7. doi: 10.1109/ICHI.2015.6. Epub 2015 Dec 10.

NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.

Database (Oxford). 2022 Dec 1;2022. doi: 10.1093/database/baac102.

Leveraging output term co-occurrence frequencies and latent associations in predicting medical subject headings.

Data Knowl Eng. 2014 Nov;94(B):189-201. doi: 10.1016/j.datak.2014.09.002. Epub 2014 Sep 18.

Unsupervised Medical Subject Heading Assignment Using Output Label Co-occurrence Statistics and Semantic Predications.

Nat Lang Process Inf Syst. 2013 Jun;7934:176-188. doi: 10.1007/978-3-642-38824-8_15.

GHS-NET a generic hybridized shallow neural network for multi-label biomedical text classification.

J Biomed Inform. 2021 Apr;116:103699. doi: 10.1016/j.jbi.2021.103699. Epub 2021 Feb 15.

FullMeSH: improving large-scale MeSH indexing with full text.

Bioinformatics. 2020 Mar 1;36(5):1533-1541. doi: 10.1093/bioinformatics/btz756.

引用本文的文献

Using artificial intelligence to develop a measure of orthopaedic treatment success from clinical notes.

Front Digit Health. 2025 Apr 24;7:1523953. doi: 10.3389/fdgth.2025.1523953. eCollection 2025.

Enhancing semantical text understanding with fine-tuned large language models: A case study on Quora Question Pair duplicate identification.

PLoS One. 2025 Jan 10;20(1):e0317042. doi: 10.1371/journal.pone.0317042. eCollection 2025.

Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts.

Comput Intell Neurosci. 2023 Feb 15;2023:2989791. doi: 10.1155/2023/2989791. eCollection 2023.

Enhancing Medical Image Retrieval with UMLS-Integrated CNN-Based Text Indexing.

Diagnostics (Basel). 2024 Jun 6;14(11):1204. doi: 10.3390/diagnostics14111204.

Predicting which patients with cancer will see a psychiatrist or counsellor from their initial oncology consultation document using natural language processing.

Commun Med (Lond). 2024 Apr 8;4(1):69. doi: 10.1038/s43856-024-00495-x.

Early Predicting Tribocorrosion Rate of Dental Implant Titanium Materials Using Random Forest Machine Learning Models.

Tribol Int. 2023 Sep;187. doi: 10.1016/j.triboint.2023.108735. Epub 2023 Jun 26.

A Disease-Prediction Protocol Integrating Triage Priority and BERT-Based Transfer Learning for Intelligent Triage.

Bioengineering (Basel). 2023 Mar 27;10(4):420. doi: 10.3390/bioengineering10040420.

Predicting the Survival of Patients With Cancer From Their Initial Oncology Consultation Document Using Natural Language Processing.

JAMA Netw Open. 2023 Feb 1;6(2):e230813. doi: 10.1001/jamanetworkopen.2023.0813.

Weakly Supervised Ternary Stream Data Augmentation Fine-Grained Classification Network for Identifying Acute Lymphoblastic Leukemia.

Diagnostics (Basel). 2021 Dec 22;12(1):16. doi: 10.3390/diagnostics12010016.

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.

J Biomed Inform. 2022 Jan;125:103957. doi: 10.1016/j.jbi.2021.103957. Epub 2021 Nov 22.

本文引用的文献

Leveraging output term co-occurrence frequencies and latent associations in predicting medical subject headings.

Data Knowl Eng. 2014 Nov;94(B):189-201. doi: 10.1016/j.datak.2014.09.002. Epub 2014 Sep 18.

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.

BMC Bioinformatics. 2015 Apr 30;16:138. doi: 10.1186/s12859-015-0564-6.

Feature engineering for MEDLINE citation categorization with MeSH.

BMC Bioinformatics. 2015 Apr 8;16:113. doi: 10.1186/s12859-015-0539-7.

Context-driven automatic subgraph creation for literature-based discovery.

J Biomed Inform. 2015 Apr;54:141-57. doi: 10.1016/j.jbi.2015.01.014. Epub 2015 Feb 7.

Knowledge based word-concept model estimation and refinement for biomedical text mining.

J Biomed Inform. 2015 Feb;53:300-7. doi: 10.1016/j.jbi.2014.11.015. Epub 2014 Dec 12.

Learning regular expressions for clinical text classification.

J Am Med Inform Assoc. 2014 Sep-Oct;21(5):850-7. doi: 10.1136/amiajnl-2013-002411. Epub 2014 Feb 27.

Comparison and combination of several MeSH indexing approaches.

AMIA Annu Symp Proc. 2013 Nov 16;2013:709-18. eCollection 2013.

Recommending MeSH terms for annotating biomedical articles.

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):660-7. doi: 10.1136/amiajnl-2010-000055. Epub 2011 May 25.

An overview of MetaMap: historical perspective and recent advances.

J Am Med Inform Assoc. 2010 May-Jun;17(3):229-36. doi: 10.1136/jamia.2009.002733.

Optimal training sets for Bayesian prediction of MeSH assignment.

J Am Med Inform Assoc. 2008 Jul-Aug;15(4):546-53. doi: 10.1197/jamia.M2431. Epub 2008 Apr 24.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于生物医学文本分类的卷积神经网络：在生物医学文章索引中的应用

Convolutional Neural Networks for Biomedical Text Classification: Application in Indexing Biomedical Articles.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献