Xiang Richard F
Department of Pathology and Laboratory Medicine, Dalhousie University, Halifax, Nova Scotia, Canada.
J Pathol Inform. 2024 Jan 4;15:100358. doi: 10.1016/j.jpi.2023.100358. eCollection 2024 Dec.
Natural language processing (NLP) has been used to extract information from and summarize medical reports. Currently, the most advanced NLP models require large training datasets of accurately labeled medical text. An approach to creating these large datasets is to use low resource intensive classical NLP algorithms. In this manuscript, we examined how an automated classical NLP algorithm was able to classify portions of bone marrow report text into their appropriate sections. A total of 1480 bone marrow reports were extracted from the laboratory information system of a tertiary healthcare network. The free text of these bone marrow reports were preprocessed by separating the reports into text blocks and then removing the section headers. A natural language processing algorithm involving n-grams and K-means clustering was used to classify the text blocks into their appropriate bone marrow sections. The impact of token replacement of numerical values, accession numbers, and clusters of differentiation, varying the number of centroids (1-19) and n-grams (1-5), and utilizing an ensemble algorithm were assessed. The optimal NLP model was found to employ an ensemble algorithm that incorporated token replacement, utilized 1-gram or bag of words, and 10 centroids for K-means clustering. This optimal model was able to classify text blocks with an accuracy of 89%, suggesting that classical NLP models can accurately classify portions of marrow report text.
自然语言处理(NLP)已被用于从医学报告中提取信息并进行总结。目前,最先进的NLP模型需要大量精确标注的医学文本训练数据集。创建这些大型数据集的一种方法是使用资源消耗较低的经典NLP算法。在本手稿中,我们研究了一种自动化经典NLP算法如何能够将骨髓报告文本的各个部分分类到适当的章节中。从一个三级医疗保健网络的实验室信息系统中提取了总共1480份骨髓报告。这些骨髓报告的自由文本经过预处理,先将报告分成文本块,然后去除章节标题。使用一种涉及n元语法和K均值聚类的自然语言处理算法将文本块分类到适当的骨髓章节中。评估了数值、 accession编号和分化簇的令牌替换、质心数量(1 - 19)和n元语法数量(1 - 5)的变化以及使用集成算法的影响。发现最优的NLP模型采用了一种集成算法,该算法结合了令牌替换,使用了1元语法或词袋模型,以及用于K均值聚类的10个质心。这个最优模型能够以89%的准确率对文本块进行分类,这表明经典NLP模型可以准确地对骨髓报告文本的各个部分进行分类。