Chen Liang-Ching, Chang Kuei-Hu
Department of Foreign Languages R.O.C. Military Academy Kaohsiung Taiwan.
Institute of Education, National Sun Yat-sen University Kaohsiung Taiwan.
Int J Intell Syst. 2021 Jul;36(7):3190-3216. doi: 10.1002/int.22413. Epub 2021 Mar 11.
A corpus is a massive body of structured textual data that are stored and operated electronically. It usually combines with statistics, machine learning algorithms, or artificial intelligence (AI) technologies to explore the semantic relationship between lexical units, and beneficial when applied to language learning, information processing, translation, and so forth. In the face of a novel disease, like, COVID-19, establishing medical-specific corpus will enhance frontline medical personnel's information acquisition efficiency, guiding them on the right approaches to respond to and prevent the novel disease. To effectively retrieve critical messages from the corpus, appropriately handling word-ranking issues is quite crucial. However, traditional frequency-based approaches may cause bias in handling word-ranking issues because they neither optimize the corpus nor integrally take words' frequency dispersion and concentration criteria into consideration. Thus, this paper develops a novel corpus-based approach that combines a corpus software and Hirsch index (H-index) algorithm to handle the aforementioned issues simultaneously, making word-ranking processes more accurate. This paper compiled 100 COVID-19-related research articles as an empirical example of the target corpus. To verify the proposed approach, this study compared the results of two traditional frequency-based approaches and the proposed approach. The results indicate that the proposed approach can refine corpus and simultaneously compute words' frequency dispersion and concentration criteria in handling word-ranking issues.
语料库是大量以电子方式存储和操作的结构化文本数据。它通常与统计学、机器学习算法或人工智能(AI)技术相结合,以探索词汇单元之间的语义关系,并且在应用于语言学习、信息处理、翻译等方面时很有帮助。面对像COVID-19这样的新型疾病,建立特定医学语料库将提高一线医务人员的信息获取效率,指导他们采取正确的方法应对和预防这种新型疾病。为了有效地从语料库中检索关键信息,妥善处理词序问题至关重要。然而,传统的基于频率的方法在处理词序问题时可能会导致偏差,因为它们既没有优化语料库,也没有全面考虑词频的分散和集中标准。因此,本文开发了一种基于语料库的新方法,该方法结合了语料库软件和赫希指数(H指数)算法来同时处理上述问题,使词序处理过程更加准确。本文收集了100篇与COVID-19相关的研究文章作为目标语料库的实证示例。为了验证所提出的方法,本研究比较了两种传统的基于频率的方法和所提出方法的结果。结果表明,所提出的方法可以优化语料库,并在处理词序问题时同时计算词频的分散和集中标准。
Cochrane Database Syst Rev. 2022-2-1
PLoS One. 2018-3-12
J Biomed Inform. 2018-9-12
Infect Dis Ther. 2020-12
J Orthop Surg Res. 2020-10-8
Expert Syst Appl. 2020-12-1