• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于在没有标签信息的情况下分析和分类阿法尔奥罗莫语电子医疗文档的主题建模方法。

A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information.

作者信息

Dinsa Etana Fikadu, Das Mrinal, Abebe Teklu Urgessa

机构信息

Department of Computer Science and Engineering, Engineering and Technology, Wollega University, Oromia, Ethiopia.

Department of Data Science, Indian Institute of Technology Palakkad(IIT Palakkad), Palakkad, India.

出版信息

Sci Rep. 2024 Dec 30;14(1):32051. doi: 10.1038/s41598-024-83743-3.

DOI:10.1038/s41598-024-83743-3
PMID:39738682
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11686009/
Abstract

Afaan Oromo is a resource-scarce language with limited tools developed for its processing, posing significant challenges for natural language tasks. The tools designed for English do not work efficiently for Afaan Oromo due to the linguistic differences and lack of well-structured resources. To address this challenge, this work proposes a topic modeling framework for unstructured health-related documents in Afaan Oromo using latent dirichlet allocation (LDA) algorithms. All collected documents lack label information, which poses significant challenges for categorizing the documents and applying the supervised learning methods. So, we utilize the LDA model since it offers solutions to this problem by allowing discovery of the latent topics of the documents without requiring the predefined labels. The model takes a word dictionary to extract hidden topics by evaluating word patterns and distributions across the dataset. Then it extracts the most relevant document topics and generates weight values for each word in the documents per topic. Next, we classify the topics using the represented keyword as input and assign class labels based on human evaluations topic coherence. This model could be applied to classifying medical documents and used to find specialists who best suitable for patients' requests from the obtained information. As a conclusion of our findings, the topic modeling using LDA gave the promised value of 79.17% accuracy and 79.66% F1 score for test documents of the dataset.

摘要

阿法安奥罗莫语是一种资源稀缺的语言,为其处理开发的工具有限,这给自然语言任务带来了重大挑战。由于语言差异和缺乏结构良好的资源,为英语设计的工具对阿法安奥罗莫语无法高效运行。为应对这一挑战,本文提出了一种使用潜在狄利克雷分配(LDA)算法对阿法安奥罗莫语中与健康相关的非结构化文档进行主题建模的框架。所有收集到的文档都缺乏标签信息,这给文档分类和应用监督学习方法带来了重大挑战。因此,我们使用LDA模型,因为它通过允许在不需要预定义标签的情况下发现文档的潜在主题,为这个问题提供了解决方案。该模型采用一个单词词典,通过评估数据集中的单词模式和分布来提取隐藏主题。然后,它提取最相关的文档主题,并为每个主题中文档的每个单词生成权重值。接下来,我们以所表示的关键词作为输入对主题进行分类,并根据人工评估的主题连贯性分配类别标签。该模型可应用于医学文档分类,并用于从获得的信息中找到最适合患者需求的专家。作为我们研究结果的总结,使用LDA进行主题建模对数据集中的测试文档给出了79.17%的准确率和79.66%的F1分数的预期值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/0a3fbc83b2b2/41598_2024_83743_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/ad9fd1d451d2/41598_2024_83743_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/b7eae3f55269/41598_2024_83743_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/012911b5b484/41598_2024_83743_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/689df018601a/41598_2024_83743_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/a51bb89e8bfd/41598_2024_83743_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/f10f5fa4c2c7/41598_2024_83743_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/0a3fbc83b2b2/41598_2024_83743_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/ad9fd1d451d2/41598_2024_83743_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/b7eae3f55269/41598_2024_83743_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/012911b5b484/41598_2024_83743_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/689df018601a/41598_2024_83743_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/a51bb89e8bfd/41598_2024_83743_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/f10f5fa4c2c7/41598_2024_83743_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/0a3fbc83b2b2/41598_2024_83743_Fig7_HTML.jpg

相似文献

1
A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information.一种用于在没有标签信息的情况下分析和分类阿法尔奥罗莫语电子医疗文档的主题建模方法。
Sci Rep. 2024 Dec 30;14(1):32051. doi: 10.1038/s41598-024-83743-3.
2
Cardiology record multi-label classification using latent Dirichlet allocation.使用潜在狄利克雷分配进行心脏病学记录的多标签分类。
Comput Methods Programs Biomed. 2018 Oct;164:111-119. doi: 10.1016/j.cmpb.2018.07.002. Epub 2018 Jul 17.
3
Web content topic modeling using LDA and HTML tags.使用潜在狄利克雷分配(LDA)和HTML标签的网页内容主题建模
PeerJ Comput Sci. 2023 Jul 11;9:e1459. doi: 10.7717/peerj-cs.1459. eCollection 2023.
4
Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.研究基于神经主题模型的词向量有效利用,以实现短文本的可解释主题。
Sensors (Basel). 2022 Jan 23;22(3):852. doi: 10.3390/s22030852.
5
An integrated clustering and BERT framework for improved topic modeling.一种用于改进主题建模的集成聚类和BERT框架。
Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.
6
AI-based disease category prediction model using symptoms from low-resource Ethiopian language: Afaan Oromo text.基于人工智能的疾病类别预测模型,利用来自资源匮乏的埃塞俄比亚语言(阿法尔语)的症状文本。
Sci Rep. 2024 May 16;14(1):11233. doi: 10.1038/s41598-024-62278-7.
7
Latent Dirichlet Allocation in predicting clinical trial terminations.潜在狄利克雷分配在预测临床试验终止中的应用。
BMC Med Inform Decis Mak. 2019 Nov 27;19(1):242. doi: 10.1186/s12911-019-0973-y.
8
Topic2features: a novel framework to classify noisy and sparse textual data using LDA topic distributions.主题2特征:一种使用LDA主题分布对噪声和稀疏文本数据进行分类的新颖框架。
PeerJ Comput Sci. 2021 Aug 11;7:e677. doi: 10.7717/peerj-cs.677. eCollection 2021.
9
Redundancy-aware topic modeling for patient record notes.用于病历记录的冗余感知主题建模
PLoS One. 2014 Feb 13;9(2):e87555. doi: 10.1371/journal.pone.0087555. eCollection 2014.
10
AI-powered topic modeling: comparing LDA and BERTopic in analyzing opioid-related cardiovascular risks in women.人工智能驱动的主题建模:比较LDA和BERTopic在分析女性阿片类药物相关心血管风险中的应用
Exp Biol Med (Maywood). 2025 Feb 28;250:10389. doi: 10.3389/ebm.2025.10389. eCollection 2025.

本文引用的文献

1
CliqueFluxNet: Unveiling EHR Insights with Stochastic Edge Fluxing and Maximal Clique Utilisation Using Graph Neural Networks.CliqueFluxNet:使用图神经网络通过随机边通量和最大团利用揭示电子健康记录见解
J Healthc Inform Res. 2024 Aug 1;8(3):555-575. doi: 10.1007/s41666-024-00169-2. eCollection 2024 Sep.
2
AI-based disease category prediction model using symptoms from low-resource Ethiopian language: Afaan Oromo text.基于人工智能的疾病类别预测模型,利用来自资源匮乏的埃塞俄比亚语言(阿法尔语)的症状文本。
Sci Rep. 2024 May 16;14(1):11233. doi: 10.1038/s41598-024-62278-7.
3
Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling.
整合网络爬虫、单类支持向量机和潜在狄利克雷分配主题建模的无监督文档分类
J Appl Stat. 2021 Apr 27;50(3):574-591. doi: 10.1080/02664763.2021.1919063. eCollection 2023.
4
Twitter-Based Sentiment Analysis and Topic Modeling of Social Media Posts Using Natural Language Processing, to Understand People's Perspectives Regarding COVID-19 Booster Vaccine Shots in India: Crucial to Expanding Vaccination Coverage.基于推特的社交媒体帖子情感分析与主题建模:利用自然语言处理来了解印度民众对新冠疫苗加强针的看法,这对扩大疫苗接种覆盖率至关重要。
Vaccines (Basel). 2022 Nov 15;10(11):1929. doi: 10.3390/vaccines10111929.
5
TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.文本网络主题:基于文本分类的词群分组作为主题及主题评分
Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.
6
A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts.LDA、NMF、Top2Vec和BERTopic用于揭秘推特帖子的主题建模比较
Front Sociol. 2022 May 6;7:886498. doi: 10.3389/fsoc.2022.886498. eCollection 2022.
7
Topic Modeling for Interpretable Text Classification From EHRs.用于电子健康记录可解释文本分类的主题建模
Front Big Data. 2022 May 4;5:846930. doi: 10.3389/fdata.2022.846930. eCollection 2022.
8
Clustering and topic modeling over tweets: A comparison over a health dataset.推特上的聚类与主题建模:基于健康数据集的比较
Proceedings (IEEE Int Conf Bioinformatics Biomed). 2019 Nov;2019:1544-1547. doi: 10.1109/bibm47256.2019.8983167. Epub 2020 Feb 6.
9
Text Mining and Automation for Processing of Patient Referrals.文本挖掘和自动化在患者转介处理中的应用。
Appl Clin Inform. 2018 Jan;9(1):232-237. doi: 10.1055/s-0038-1639482. Epub 2018 Mar 28.