• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

印度尼西亚消费者健康问题的语义分类。

Semantic classification of Indonesian consumer health questions.

作者信息

Hanami Raniah Nur, Mahendra Rahmad, Wicaksono Alfan Farizki

机构信息

Faculty of Computer Science, Universitas Indonesia, Kampus UI, Depok, 16424, West Java, Indonesia.

出版信息

J Biomed Semantics. 2025 Jul 28;16(1):13. doi: 10.1186/s13326-025-00334-5.

DOI:10.1186/s13326-025-00334-5
PMID:40721829
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12302743/
Abstract

PURPOSE

Online consumer health forums serve as a way for the public to connect with medical professionals. While these medical forums offer a valuable service, online Question Answering (QA) forums can struggle to deliver timely answers due to the limited number of available healthcare professionals. One way to solve this problem is by developing an automatic QA system that can provide patients with quicker answers. One key component of such a system could be a module for classifying the semantic type of a question. This would allow the system to understand the patient's intent and route them towards the relevant information.

METHODS

This paper proposes a novel two-step approach to address the challenge of semantic type classification in Indonesian consumer health questions. We acknowledge the scarcity of Indonesian health domain data, a hurdle for machine learning models. To address this gap, we first introduce a novel corpus of annotated Indonesian consumer health questions. Second, we utilize this newly created corpus to build and evaluate a data-driven predictive model for classifying question semantic types. To enhance the trustworthiness and interpretability of the model's predictions, we employ an explainable model framework, LIME. This framework facilitates a deeper understanding of the role played by word-based features in the model's decision-making process. Additionally, it empowers us to conduct a comprehensive bias analysis, allowing for the detection of "semantic bias", where words with no inherent association with a specific semantic type disproportionately influence the model's predictions.

RESULTS

The annotation process revealed moderate agreement between expert annotators. In addition, not all words with high LIME probability could be considered true characteristics of a question type. This suggests a potential bias in the data used and the machine learning models themselves. Notably, XGBoost, Naïve Bayes, and MLP models exhibited a tendency to predict questions containing the words "kanker" (cancer) and "depresi" (depression) as belonging to the DIAGNOSIS category. In terms of prediction performance, Perceptron and XGBoost emerged as the top-performing models, achieving the highest weighted average F1 scores across all input scenarios and weighting factors. Naïve Bayes performed best after balancing the data with Borderline SMOTE, indicating its promise for handling imbalanced datasets.

CONCLUSION

We constructed a corpus of query semantics in the domain of Indonesian consumer health, containing 964 questions annotated with their corresponding semantic types. This corpus served as the foundation for building a predictive model. We further investigated the impact of disease-biased words on model performance. These words exhibited high LIME scores, yet lacked association with a specific semantic type. We trained models using datasets with and without these biased words and found no significant difference in model performance between the two scenarios, suggesting that the models might possess an ability to mitigate the influence of such bias during the learning process.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00ab/12302743/5e6a848df479/13326_2025_334_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00ab/12302743/04e73af2589d/13326_2025_334_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00ab/12302743/89ae87c6fd41/13326_2025_334_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00ab/12302743/5e6a848df479/13326_2025_334_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00ab/12302743/04e73af2589d/13326_2025_334_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00ab/12302743/89ae87c6fd41/13326_2025_334_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00ab/12302743/5e6a848df479/13326_2025_334_Fig3_HTML.jpg
摘要

目的

在线消费者健康论坛是公众与医学专业人员建立联系的一种方式。虽然这些医学论坛提供了有价值的服务,但由于可用医疗专业人员数量有限,在线问答(QA)论坛可能难以及时提供答案。解决这个问题的一种方法是开发一个自动问答系统,该系统可以为患者提供更快的答案。这种系统的一个关键组件可能是一个用于对问题的语义类型进行分类的模块。这将使系统能够理解患者的意图,并将他们引导至相关信息。

方法

本文提出了一种新颖的两步法来应对印度尼西亚消费者健康问题中语义类型分类的挑战。我们认识到印度尼西亚健康领域数据的稀缺性,这是机器学习模型面临的一个障碍。为了弥补这一差距,我们首先引入了一个带注释的印度尼西亚消费者健康问题的新语料库。其次,我们利用这个新创建的语料库来构建和评估一个用于对问题语义类型进行分类的数据驱动预测模型。为了提高模型预测的可信度和可解释性,我们采用了一个可解释模型框架LIME。这个框架有助于更深入地理解基于单词的特征在模型决策过程中所起的作用。此外,它使我们能够进行全面的偏差分析,从而检测出“语义偏差”,即与特定语义类型没有内在关联的单词对模型预测产生不成比例的影响。

结果

注释过程显示专家注释者之间的一致性适中。此外,并非所有具有高LIME概率的单词都可被视为问题类型的真实特征。这表明所用数据和机器学习模型本身可能存在偏差。值得注意的是,XGBoost、朴素贝叶斯和MLP模型表现出将包含单词“kanker”(癌症)和“depresi”(抑郁症)的问题预测为属于诊断类别的倾向。在预测性能方面,感知机和XGBoost成为表现最佳的模型,在所有输入场景和加权因子下均获得最高的加权平均F1分数。朴素贝叶斯在使用Borderline SMOTE对数据进行平衡后表现最佳,表明其在处理不平衡数据集方面的潜力。

结论

我们构建了一个印度尼西亚消费者健康领域的查询语义语料库,其中包含964个标注了相应语义类型的问题。这个语料库作为构建预测模型的基础。我们进一步研究了疾病偏向性单词对模型性能的影响。这些单词表现出较高的LIME分数,但与特定语义类型缺乏关联。我们使用有无这些偏向性单词的数据集训练模型,发现两种情况下模型性能没有显著差异,这表明模型可能具有在学习过程中减轻此类偏差影响的能力。

相似文献

1
Semantic classification of Indonesian consumer health questions.印度尼西亚消费者健康问题的语义分类。
J Biomed Semantics. 2025 Jul 28;16(1):13. doi: 10.1186/s13326-025-00334-5.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
4
Sentences, entities, and keyphrases extraction from consumer health forums using multi-task learning.使用多任务学习从消费者健康论坛中提取句子、实体和关键短语。
J Biomed Semantics. 2025 May 6;16(1):8. doi: 10.1186/s13326-025-00329-2.
5
Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?当前的生存预测工具在治疗骨转移后的骨骼相关事件时有用吗?
Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.
6
Does the Presence of Missing Data Affect the Performance of the SORG Machine-learning Algorithm for Patients With Spinal Metastasis? Development of an Internet Application Algorithm.缺失数据的存在是否会影响 SORG 机器学习算法在脊柱转移瘤患者中的性能?开发一种互联网应用算法。
Clin Orthop Relat Res. 2024 Jan 1;482(1):143-157. doi: 10.1097/CORR.0000000000002706. Epub 2023 Jun 12.
7
Short-Term Memory Impairment短期记忆障碍
8
Sexual Harassment and Prevention Training性骚扰与预防培训
9
Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验:定性证据综合。
Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.
10
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

本文引用的文献

1
Qcorp: an annotated classification corpus of Chinese health questions.Qcorp:一个带注释的中文健康问题分类语料库。
BMC Med Inform Decis Mak. 2018 Mar 22;18(Suppl 1):16. doi: 10.1186/s12911-018-0593-y.
2
Semantic annotation of consumer health questions.消费者健康问题的语义标注。
BMC Bioinformatics. 2018 Feb 6;19(1):34. doi: 10.1186/s12859-018-2045-1.
3
Medical Question Answering for Clinical Decision Support.用于临床决策支持的医学问答
Proc ACM Int Conf Inf Knowl Manag. 2016 Oct;2016:297-306. doi: 10.1145/2983323.2983819.
4
A Machine Learning-based Method for Question Type Classification in Biomedical Question Answering.一种基于机器学习的生物医学问答中问题类型分类方法。
Methods Inf Med. 2017 May 18;56(3):209-216. doi: 10.3414/ME16-01-0116. Epub 2017 Mar 31.
5
Interactive use of online health resources: a comparison of consumer and professional questions.在线健康资源的交互使用:消费者问题与专业问题的比较
J Am Med Inform Assoc. 2016 Jul;23(4):802-11. doi: 10.1093/jamia/ocw024. Epub 2016 May 4.
6
An Ensemble Method for Spelling Correction in Consumer Health Questions.一种用于消费者健康问题拼写纠正的集成方法。
AMIA Annu Symp Proc. 2015 Nov 5;2015:727-36. eCollection 2015.
7
Automatically classifying question types for consumer health questions.自动对消费者健康问题的问题类型进行分类。
AMIA Annu Symp Proc. 2014 Nov 14;2014:1018-27. eCollection 2014.
8
The MiPACQ clinical question answering system.MiPACQ临床问答系统。
AMIA Annu Symp Proc. 2011;2011:171-80. Epub 2011 Oct 22.
9
An ontology for clinical questions about the contents of patient notes.一个关于患者病历内容的临床问题的本体论。
J Biomed Inform. 2012 Apr;45(2):292-306. doi: 10.1016/j.jbi.2011.11.008. Epub 2011 Nov 28.
10
Toward automated consumer question answering: automatically separating consumer questions from professional questions in the healthcare domain.迈向自动化消费者问答:在医疗保健领域自动区分消费者问题和专业问题。
J Biomed Inform. 2011 Dec;44(6):1032-8. doi: 10.1016/j.jbi.2011.08.008. Epub 2011 Aug 12.