用于中文医学摘要句子分类的合成数据驱动方法：计算研究

Synthetic Data-Driven Approaches for Chinese Medical Abstract Sentence Classification: Computational Study.

作者信息

Li Jiajia, Wang Zikai, Yu Longxuan, Liu Hui, Song Haitao

机构信息

Shanghai Artificial Intelligence Research Institute Co., Ltd, Shanghai, China.

Xiangfu Laboratory, Jiaxing, China.

出版信息

JMIR Form Res. 2025 Mar 19;9:e54803. doi: 10.2196/54803.

BACKGROUND

Medical abstract sentence classification is crucial for enhancing medical database searches, literature reviews, and generating new abstracts. However, Chinese medical abstract classification research is hindered by a lack of suitable datasets. Given the vastness of Chinese medical literature and the unique value of traditional Chinese medicine, precise classification of these abstracts is vital for advancing global medical research.

OBJECTIVE

This study aims to address the data scarcity issue by generating a large volume of labeled Chinese abstract sentences without manual annotation, thereby creating new training datasets. Additionally, we seek to develop more accurate text classification algorithms to improve the precision of Chinese medical abstract classification.

METHODS

We developed 3 training datasets (dataset #1, dataset #2, and dataset #3) and a test dataset to evaluate our model. Dataset #1 contains 15,000 abstract sentences translated from the PubMed dataset into Chinese. Datasets #2 and #3, each with 15,000 sentences, were generated using GPT-3.5 from 40,000 Chinese medical abstracts in the CSL database. Dataset #2 used titles and keywords for pseudolabeling, while dataset #3 aligned abstracts with category labels. The test dataset includes 87,000 sentences from 20,000 abstracts. We used SBERT embeddings for deeper semantic analysis and evaluated our model using clustering (SBERT-DocSCAN) and supervised methods (SBERT-MEC). Extensive ablation studies and feature analyses were conducted to validate the model's effectiveness and robustness.

RESULTS

Our experiments involved training both clustering and supervised models on the 3 datasets, followed by comprehensive evaluation using the test dataset. The outcomes demonstrated that our models outperformed the baseline metrics. Specifically, when trained on dataset #1, the SBERT-DocSCAN model registered an impressive accuracy and F1-score of 89.85% on the test dataset. Concurrently, the SBERT-MEC algorithm exhibited comparable performance with an accuracy of 89.38% and an identical F1-score. Training on dataset #2 yielded similarly positive results for the SBERT-DocSCAN model, achieving an accuracy and F1-score of 89.83%, while the SBERT-MEC algorithm recorded an accuracy of 86.73% and an F1-score of 86.51%. Notably, training with dataset #3 allowed the SBERT-DocSCAN model to attain the best with an accuracy and F1-score of 91.30%, whereas the SBERT-MEC algorithm also showed robust performance, obtaining an accuracy of 90.39% and an F1-score of 90.35%. Ablation analysis highlighted the critical role of integrated features and methodologies in improving classification efficiency.

CONCLUSIONS

Our approach addresses the challenge of limited datasets for Chinese medical abstract classification by generating novel datasets. The deployment of SBERT-DocSCAN and SBERT-MEC models significantly enhances the precision of classifying Chinese medical abstracts, even when using synthetic datasets with pseudolabels.

背景

医学摘要句子分类对于改进医学数据库搜索、文献综述以及生成新摘要至关重要。然而，中文医学摘要分类研究因缺乏合适的数据集而受到阻碍。鉴于中文医学文献数量庞大以及中医的独特价值，这些摘要的精确分类对于推动全球医学研究至关重要。

目的

本研究旨在通过无需人工标注生成大量带标签的中文摘要句子来解决数据稀缺问题，从而创建新的训练数据集。此外，我们试图开发更准确的文本分类算法以提高中文医学摘要分类的精度。

方法

我们开发了3个训练数据集（数据集#1、数据集#2和数据集#3）以及一个测试数据集来评估我们的模型。数据集#1包含从PubMed数据集翻译成中文的15000个摘要句子。数据集#2和#3各有15000个句子，是使用GPT-3.5从中国生物医学文献数据库（CSL）中的40000篇中文医学摘要生成的。数据集#2使用标题和关键词进行伪标注，而数据集#3将摘要与类别标签对齐。测试数据集包括来自20000篇摘要的87000个句子。我们使用SBERT嵌入进行更深入的语义分析，并使用聚类（SBERT-DocSCAN）和监督方法（SBERT-MEC）评估我们的模型。进行了广泛的消融研究和特征分析以验证模型的有效性和稳健性。

结果

我们的实验包括在这3个数据集上训练聚类模型和监督模型，然后使用测试数据集进行全面评估。结果表明我们的模型优于基线指标。具体而言，当在数据集#1上训练时，SBERT-DocSCAN模型在测试数据集上的准确率和F1分数令人印象深刻，分别为89.85%。同时，SBERT-MEC算法表现相当，准确率为89.38%，F1分数相同。在数据集#2上训练时，SBERT-DocSCAN模型也得到了类似的积极结果，准确率和F1分数达到89.83%，而SBERT-MEC算法的准确率为86.73%，F1分数为86.51%。值得注意的是，使用数据集#3训练使SBERT-DocSCAN模型达到最佳，准确率和F1分数为91.30%，而SBERT-MEC算法也表现出稳健的性能，准确率为90.39%，F1分数为90.35%。消融分析突出了集成特征和方法在提高分类效率方面的关键作用。

结论

我们的方法通过生成新颖的数据集解决了中文医学摘要分类中数据集有限的挑战。SBERT-DocSCAN和SBERT-MEC模型的部署显著提高了中文医学摘要分类的精度，即使使用带有伪标签的合成数据集也是如此。

Synthetic Data-Driven Approaches for Chinese Medical Abstract Sentence Classification: Computational Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献