Hung Wei-Chieh, Lin Yih-Lon, Lin Chi-Wei, Chin Wei-Leng, Wu Chih-Hsing
Department of Family and Community Medicine, E-Da Hospital, I-Shou University, Kaohsiung 82445, Taiwan.
School of Medicine, I-Shou University, Kaohsiung 84001, Taiwan.
Diagnostics (Basel). 2024 Jan 8;14(2):137. doi: 10.3390/diagnostics14020137.
This study aims to establish advanced sampling methods in free-text data for efficiently building semantic text mining models using deep learning, such as identifying vertebral compression fracture (VCF) in radiology reports. We enrolled a total of 27,401 radiology free-text reports of X-ray examinations of the spine. The predictive effects were compared between text mining models built using supervised long short-term memory networks, independently derived by four sampling methods: vector sum minimization, vector sum maximization, stratified, and simple random sampling, using four fixed percentages. The drawn samples were applied to the training set, and the remaining samples were used to validate each group using different sampling methods and ratios. The predictive accuracy was measured using the area under the receiver operating characteristics (AUROC) to identify VCF. At the sampling ratios of 1/10, 1/20, 1/30, and 1/40, the highest AUROC was revealed in the sampling methods of vector sum minimization as confidence intervals of 0.981 (95%CIs: 0.980-0.983)/0.963 (95%CIs: 0.961-0.965)/0.907 (95%CIs: 0.904-0.911)/0.895 (95%CIs: 0.891-0.899), respectively. The lowest AUROC was demonstrated in the vector sum maximization. This study proposes an advanced sampling method, vector sum minimization, in free-text data that can be efficiently applied to build the text mining models by smartly drawing a small amount of critical representative samples.
本研究旨在建立自由文本数据中的先进采样方法,以便使用深度学习高效构建语义文本挖掘模型,例如在放射学报告中识别椎体压缩性骨折(VCF)。我们纳入了总共27401份脊柱X线检查的放射学自由文本报告。使用四种固定百分比,通过四种采样方法独立推导,比较了使用监督长短期记忆网络构建的文本挖掘模型之间的预测效果:向量和最小化、向量和最大化、分层抽样和简单随机抽样。抽取的样本应用于训练集,其余样本用于使用不同的采样方法和比例验证每组。使用受试者工作特征曲线下面积(AUROC)测量预测准确性以识别VCF。在1/10、1/20、1/30和1/40的采样率下,向量和最小化采样方法的AUROC最高,置信区间分别为0.981(95%CI:0.980 - 0.983)/0.963(95%CI:0.961 - 0.965)/0.907(95%CI:0.904 - 0.911)/0.895(95%CI:0.891 - 0.899)。向量和最大化的AUROC最低。本研究提出了一种自由文本数据中的先进采样方法——向量和最小化,通过巧妙抽取少量关键代表性样本,可有效应用于构建文本挖掘模型。