基于机器学习模型集成与 BERT 语言模型的脑 CT 报告文本描述分析用于判断颅内出血的比较研究
Comparison of an Ensemble of Machine Learning Models and the BERT Language Model for Analysis of Text Descriptions of Brain CT Reports to Determine the Presence of Intracranial Hemorrhage.
机构信息
Junior Researcher, Department of Innovative Technologies; Scientific and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Department of Health, Bldg 1, 24 Petrovka St., Moscow, 127051, Russia.
Junior Researcher, Department of Medical Informatics, Radiomics and Radiogenomics; Scientific and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Department of Health, Bldg 1, 24 Petrovka St., Moscow, 127051, Russia.
出版信息
Sovrem Tekhnologii Med. 2024;16(1):27-34. doi: 10.17691/stm2024.16.1.03. Epub 2024 Feb 28.
UNLABELLED
is to train and test an ensemble of machine learning models, as well as to compare its performance with the BERT language model pre-trained on medical data to perform simple binary classification, i.e., determine the presence/absence of the signs of intracranial hemorrhage (ICH) in brain CT reports.
MATERIALS AND METHODS
Seven machine learning algorithms and three text vectorization techniques were selected as models to solve the binary classification problem. These models were trained on textual data represented by 3980 brain CT reports from 56 inpatient medical facilities in Moscow. The study utilized three text vectorization techniques: bag of words, TF-IDF, and word2vec. The resulting data were then processed by the following machine learning algorithms: decision tree, random forest, logistic regression, nearest neighbors, support vector machines, Catboost, and XGboost. Data analysis and pre-processing were performed using NLTK (Natural Language Toolkit, version 3.6.5), libraries for character-based and statistical processing of natural language, and Scikit-learn (version 0.24.2), a library for machine learning containing tools to tackle classification challenges. MedRuBertTiny2 was taken as a BERT transformer model pre-trained on medical data.
RESULTS
Based on the training and testing outcomes from seven machine learning algorithms, the authors selected three algorithms that yielded the highest metrics (i.e. sensitivity and specificity): CatBoost, logistic regression, and nearest neighbors. The highest metrics were achieved by the bag of words technique. These algorithms were assembled into an ensemble using the stacking technique. The sensitivity and specificity for the validation dataset separated from the original sample were 0.93 and 0.90, respectively. Next, the ensemble and the BERT model were trained on an independent dataset containing 9393 textual radiology reports also divided into training and test sets. Once the ensemble was tested on this dataset, the resulting sensitivity and specificity were 0.92 and 0.90, respectively. The BERT model tested on these data demonstrated a sensitivity of 0.97 and a specificity of 0.90.
CONCLUSION
When analyzing textual reports of brain CT scans with signs of intracranial hemorrhage, the trained ensemble demonstrated high accuracy metrics. Still, manual quality control of the results is required during its application. The pre-trained BERT transformer model, additionally trained on diagnostic textual reports, demonstrated higher accuracy metrics (p<0.05). The results show promise in terms of finding specific values for both binary classification task and in-depth analysis of unstructured medical information.
未加标签
目的是训练和测试一组机器学习模型,并将其性能与基于医学数据预训练的 BERT 语言模型进行比较,以执行简单的二分类任务,即确定脑 CT 报告中是否存在颅内出血(ICH)的迹象。
材料和方法
选择了七种机器学习算法和三种文本向量化技术作为模型来解决二分类问题。这些模型是基于来自莫斯科 56 家住院医疗机构的 3980 份脑 CT 报告的文本数据进行训练的。研究采用了三种文本向量化技术:词袋、TF-IDF 和 word2vec。然后,通过以下机器学习算法对生成的数据进行处理:决策树、随机森林、逻辑回归、最近邻、支持向量机、Catboost 和 XGboost。数据分析和预处理使用了 NLTK(自然语言工具包,版本 3.6.5),这是一个用于字符和自然语言统计处理的库,以及 Scikit-learn(版本 0.24.2),这是一个包含用于解决分类挑战的工具的机器学习库。MedRuBertTiny2 被用作基于医学数据预训练的 BERT 转换器模型。
结果
基于七种机器学习算法的训练和测试结果,作者选择了三种产生最高指标(即敏感性和特异性)的算法:Catboost、逻辑回归和最近邻。词袋技术获得了最高的指标。这些算法使用堆叠技术组合成一个集成。从原始样本中分离出来的验证数据集的灵敏度和特异性分别为 0.93 和 0.90。接下来,在包含 9393 份文本放射学报告的独立数据集上对集成和 BERT 模型进行了训练,这些报告也分为训练集和测试集。在对该数据集进行测试后,得到的灵敏度和特异性分别为 0.92 和 0.90。在这些数据上测试的 BERT 模型表现出 0.97 的灵敏度和 0.90 的特异性。
结论
在分析具有颅内出血迹象的脑 CT 扫描的文本报告时,训练好的集成模型表现出了较高的准确性指标。然而,在应用过程中仍需要对结果进行人工质量控制。此外,经过训练的 BERT 转换器模型在诊断文本报告上进行了进一步训练,表现出了更高的准确性指标(p<0.05)。这些结果在二进制分类任务和深入分析非结构化医疗信息方面具有一定的价值。