医学摘要中句子类型的分类。

Categorization of sentence types in medical abstracts.

作者信息

McKnight Larry, Srinivasan Padmini

机构信息

Department of Medical Informatics, Columbia, University, New York, NY, USA.

出版信息

AMIA Annu Symp Proc. 2003;2003:440-4.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1479904/

Abstract

This study evaluated the use of machine learning techniques in the classification of sentence type. 7253 structured abstracts and 204 unstructured abstracts of Randomized Controlled Trials from MedLINE were parsed into sentences and each sentence was labeled as one of four types (Introduction, Method, Result, or Conclusion). Support Vector Machine (SVM) and Linear Classifier models were generated and evaluated on cross-validated data. Treating sentences as a simple "bag of words", the SVM model had an average ROC area of 0.92. Adding a feature of relative sentence location improved performance markedly for some models and overall increasing the average ROC to 0.95. Linear classifier performance was significantly worse than the SVM in all datasets. Using the SVM model trained on structured abstracts to predict unstructured abstracts yielded performance similar to that of models trained with unstructured abstracts in 3 of the 4 types. We conclude that classification of sentence type seems feasible within the domain of RCT's. Identification of sentence types may be helpful for providing context to end users or other text summarization techniques.

摘要

本研究评估了机器学习技术在句子类型分类中的应用。从医学在线数据库（MedLINE）中提取了7253篇结构化摘要和204篇随机对照试验的非结构化摘要，并将其解析为句子，每个句子被标记为四种类型之一（引言、方法、结果或结论）。生成了支持向量机（SVM）和线性分类器模型，并在交叉验证数据上进行了评估。将句子视为简单的“词袋”，SVM模型的平均ROC面积为0.92。添加相对句子位置的特征显著提高了某些模型的性能，总体上将平均ROC提高到0.95。在所有数据集中，线性分类器的性能明显比SVM差。使用在结构化摘要上训练的SVM模型来预测非结构化摘要，在四种类型中的三种类型上，其性能与使用非结构化摘要训练的模型相似。我们得出结论，在随机对照试验领域内，句子类型分类似乎是可行的。识别句子类型可能有助于为最终用户提供上下文或其他文本摘要技术。

相似文献

1

Categorization of sentence types in medical abstracts.医学摘要中句子类型的分类。

AMIA Annu Symp Proc. 2003;2003:440-4.

2

Structuralizing biomedical abstracts with discriminative linguistic features.用有区别的语言特征构建生物医学文摘的结构

Comput Biol Med. 2016 Dec 1;79:276-285. doi: 10.1016/j.compbiomed.2016.10.026. Epub 2016 Nov 2.

3

Using symbolic knowledge in the UMLS to disambiguate words in small datasets with a naïve Bayes classifier.利用统一医学语言系统（UMLS）中的符号知识，通过朴素贝叶斯分类器对小数据集中的单词进行消歧。

Stud Health Technol Inform. 2004;107(Pt 1):381-5.

4

Using argumentation to extract key sentences from biomedical abstracts.利用论证从生物医学摘要中提取关键句子。

Int J Med Inform. 2007 Feb-Mar;76(2-3):195-200. doi: 10.1016/j.ijmedinf.2006.05.002. Epub 2006 Jul 11.

5

Combining text classification and Hidden Markov Modeling techniques for categorizing sentences in randomized clinical trial abstracts.结合文本分类和隐马尔可夫建模技术对随机临床试验摘要中的句子进行分类。

AMIA Annu Symp Proc. 2006;2006:824-8.

6

Classification of Clinically Useful Sentences in MEDLINE.医学文献数据库（MEDLINE）中临床有用语句的分类

AMIA Annu Symp Proc. 2015 Nov 5;2015:2015-24. eCollection 2015.

7

Shallow semantic parsing of randomized controlled trial reports.随机对照试验报告的浅层语义解析

AMIA Annu Symp Proc. 2006;2006:604-8.

8

Improving data retrieval quality: Evidence based medicine perspective.提高数据检索质量：循证医学视角

Int J Risk Saf Med. 2015;27 Suppl 1:S106-7. doi: 10.3233/JRS-150710.

9

Sentence retrieval for abstracts of randomized controlled trials.随机对照试验摘要的句子检索

BMC Med Inform Decis Mak. 2009 Feb 10;9:10. doi: 10.1186/1472-6947-9-10.

10

Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine.随机对照试验文章的自动置信度分级分类：循证医学的辅助手段

J Am Med Inform Assoc. 2015 May;22(3):707-17. doi: 10.1093/jamia/ocu025. Epub 2015 Feb 5.

引用本文的文献

1

A New Public Corpus for Clinical Section Identification: MedSecId.一个用于临床科室识别的新公共语料库：MedSecId。

Proc Int Conf Comput Ling. 2022 Oct;2022:3709-3721.

2

Research on the structure function recognition of PLOS.公共科学图书馆（PLOS）结构功能识别研究

Front Artif Intell. 2024 Jan 24;7:1254671. doi: 10.3389/frai.2024.1254671. eCollection 2024.

3

Towards precise PICO extraction from abstracts of randomized controlled trials using a section-specific learning approach.使用特定章节学习方法从随机对照试验摘要中进行精确的PICO提取。

Bioinformatics. 2023 Sep 5;39(9). doi: 10.1093/bioinformatics/btad542.

4

Translational drug-interaction corpus.药物相互作用翻译语料库。

Database (Oxford). 2022 May 18;2022. doi: 10.1093/database/baac031.

5

Comparison of Natural Language Processing Techniques in Analysis of Sparse Clinical Data: Insulin Decline by Patients.稀疏临床数据分析中自然语言处理技术的比较：患者胰岛素下降情况

AMIA Jt Summits Transl Sci Proc. 2019 May 6;2019:610-619. eCollection 2019.

6

Combination of conditional random field with a rule based method in the extraction of PICO elements.条件随机场与基于规则方法在 PICO 要素提取中的结合。

BMC Med Inform Decis Mak. 2018 Dec 4;18(1):128. doi: 10.1186/s12911-018-0699-2.

7

Automated PDF highlighting to support faster curation of literature for Parkinson's and Alzheimer's disease.自动PDF高亮显示，以支持更快地整理帕金森病和阿尔茨海默病的文献。

Database (Oxford). 2017 Jan 1;2017(1). doi: 10.1093/database/bax027.

8

DiMeX: A Text Mining System for Mutation-Disease Association Extraction.DiMeX：一种用于提取突变-疾病关联的文本挖掘系统。

PLoS One. 2016 Apr 13;11(4):e0152725. doi: 10.1371/journal.pone.0152725. eCollection 2016.

9

Identification of Patients with Family History of Pancreatic Cancer--Investigation of an NLP System Portability.胰腺癌家族史患者的识别——自然语言处理系统可移植性研究

Stud Health Technol Inform. 2015;216:604-8.

10

Extracting semantically enriched events from biomedical literature.从生物医学文献中提取语义丰富的事件。

BMC Bioinformatics. 2012 May 23;13:108. doi: 10.1186/1471-2105-13-108.

本文引用的文献

1

Exploring text mining from MEDLINE.探索来自医学文献数据库（MEDLINE）的文本挖掘。

Proc AMIA Symp. 2002:722-6.

2

Developing optimal search strategies for detecting clinically sound studies in MEDLINE.制定用于在医学文献数据库（MEDLINE）中检索临床合理研究的最佳检索策略。

J Am Med Inform Assoc. 1994 Nov-Dec;1(6):447-58. doi: 10.1136/jamia.1994.95153434.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验