Suppr超能文献

使用机器学习方法对临床试验中的中文纳入标准进行语义分类。

Semantic categorization of Chinese eligibility criteria in clinical trials using machine learning methods.

机构信息

Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China.

Philips Research China, Shanghai, 200072, China.

出版信息

BMC Med Inform Decis Mak. 2021 Apr 15;21(1):128. doi: 10.1186/s12911-021-01487-w.

Abstract

BACKGROUND

Semantic categorization analysis of clinical trials eligibility criteria based on natural language processing technology is crucial for the task of optimizing clinical trials design and building automated patient recruitment system. However, most of related researches focused on English eligibility criteria, and to the best of our knowledge, there are no researches studied the Chinese eligibility criteria. Thus in this study, we aimed to explore the semantic categories of Chinese eligibility criteria.

METHODS

We downloaded the clinical trials registration files from the website of Chinese Clinical Trial Registry (ChiCTR) and extracted both the Chinese eligibility criteria and corresponding English eligibility criteria. We represented the criteria sentences based on the Unified Medical Language System semantic types and conducted the hierarchical clustering algorithm for the induction of semantic categories. Furthermore, in order to explore the classification performance of Chinese eligibility criteria with our developed semantic categories, we implemented multiple classification algorithms, include four baseline machine learning algorithms (LR, NB, kNN, SVM), three deep learning algorithms (CNN, RNN, FastText) and two pre-trained language models (BERT, ERNIE).

RESULTS

We totally developed 44 types of semantic categories, summarized 8 topic groups, and investigated the average incidence and prevalence in 272 hepatocellular carcinoma related Chinese clinical trials. Compared with the previous proposed categories in English eligibility criteria, 13 novel categories are identified in Chinese eligibility criteria. The classification result shows that most of semantic categories performed quite well, the pre-trained language model ERNIE achieved best performance with macro-average F1 score of 0.7980 and micro-average F1 score of 0.8484.

CONCLUSION

As a pilot study of Chinese eligibility criteria analysis, we developed the 44 semantic categories by hierarchical clustering algorithms for the first times, and validated the classification capacity with multiple classification algorithms.

摘要

背景

基于自然语言处理技术对临床试验纳入标准进行语义分类分析,对于优化临床试验设计和构建自动患者招募系统的任务至关重要。然而,大多数相关研究都集中在英文纳入标准上,据我们所知,还没有研究探讨中文纳入标准。因此,在本研究中,我们旨在探索中文纳入标准的语义类别。

方法

我们从中国临床试验注册中心(ChiCTR)的网站上下载了临床试验注册文件,并提取了中文和相应的英文纳入标准。我们根据统一医学语言系统语义类型表示标准句,并进行了层次聚类算法以归纳语义类别。此外,为了探讨我们开发的语义类别对中文纳入标准的分类性能,我们实现了多种分类算法,包括 4 种基础机器学习算法(LR、NB、kNN、SVM)、3 种深度学习算法(CNN、RNN、FastText)和 2 种预训练语言模型(BERT、ERNIE)。

结果

我们共开发了 44 种语义类别,总结了 8 个主题组,并调查了 272 项肝癌相关中文临床试验中的平均发生率和患病率。与之前提出的英文纳入标准中的类别相比,在中文纳入标准中确定了 13 个新类别。分类结果表明,大多数语义类别表现相当出色,预训练语言模型 ERNIE 的表现最佳,宏平均 F1 得分为 0.7980,微平均 F1 得分为 0.8484。

结论

作为中文纳入标准分析的初步研究,我们首次通过层次聚类算法开发了 44 个语义类别,并通过多种分类算法验证了分类能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92f4/8050926/4ba79bc37655/12911_2021_1487_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验