Suppr超能文献

一种用于在没有标签信息的情况下分析和分类阿法尔奥罗莫语电子医疗文档的主题建模方法。

A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information.

作者信息

Dinsa Etana Fikadu, Das Mrinal, Abebe Teklu Urgessa

机构信息

Department of Computer Science and Engineering, Engineering and Technology, Wollega University, Oromia, Ethiopia.

Department of Data Science, Indian Institute of Technology Palakkad(IIT Palakkad), Palakkad, India.

出版信息

Sci Rep. 2024 Dec 30;14(1):32051. doi: 10.1038/s41598-024-83743-3.

Abstract

Afaan Oromo is a resource-scarce language with limited tools developed for its processing, posing significant challenges for natural language tasks. The tools designed for English do not work efficiently for Afaan Oromo due to the linguistic differences and lack of well-structured resources. To address this challenge, this work proposes a topic modeling framework for unstructured health-related documents in Afaan Oromo using latent dirichlet allocation (LDA) algorithms. All collected documents lack label information, which poses significant challenges for categorizing the documents and applying the supervised learning methods. So, we utilize the LDA model since it offers solutions to this problem by allowing discovery of the latent topics of the documents without requiring the predefined labels. The model takes a word dictionary to extract hidden topics by evaluating word patterns and distributions across the dataset. Then it extracts the most relevant document topics and generates weight values for each word in the documents per topic. Next, we classify the topics using the represented keyword as input and assign class labels based on human evaluations topic coherence. This model could be applied to classifying medical documents and used to find specialists who best suitable for patients' requests from the obtained information. As a conclusion of our findings, the topic modeling using LDA gave the promised value of 79.17% accuracy and 79.66% F1 score for test documents of the dataset.

摘要

阿法安奥罗莫语是一种资源稀缺的语言,为其处理开发的工具有限,这给自然语言任务带来了重大挑战。由于语言差异和缺乏结构良好的资源,为英语设计的工具对阿法安奥罗莫语无法高效运行。为应对这一挑战,本文提出了一种使用潜在狄利克雷分配(LDA)算法对阿法安奥罗莫语中与健康相关的非结构化文档进行主题建模的框架。所有收集到的文档都缺乏标签信息,这给文档分类和应用监督学习方法带来了重大挑战。因此,我们使用LDA模型,因为它通过允许在不需要预定义标签的情况下发现文档的潜在主题,为这个问题提供了解决方案。该模型采用一个单词词典,通过评估数据集中的单词模式和分布来提取隐藏主题。然后,它提取最相关的文档主题,并为每个主题中文档的每个单词生成权重值。接下来,我们以所表示的关键词作为输入对主题进行分类,并根据人工评估的主题连贯性分配类别标签。该模型可应用于医学文档分类,并用于从获得的信息中找到最适合患者需求的专家。作为我们研究结果的总结,使用LDA进行主题建模对数据集中的测试文档给出了79.17%的准确率和79.66%的F1分数的预期值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79c3/11686009/ad9fd1d451d2/41598_2024_83743_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验