Liu Jiaxing, Wong Zoie S Y, So H Y, Tsui Kwok Leung
School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, China.
School of Data Science, City University of Hong Kong, Kowloon, Hong Kong SAR, China.
J Am Med Inform Assoc. 2021 Jul 30;28(8):1756-1764. doi: 10.1093/jamia/ocab048.
This study aims to improve the classification of the fall incident severity level by considering data imbalance issues and structured features through machine learning.
We present an incident report classification (IRC) framework to classify the in-hospital fall incident severity level by addressing the imbalanced class problem and incorporating structured attributes. After text preprocessing, bag-of-words features, structured text features, and structured clinical features were extracted from the reports. Next, resampling techniques were incorporated into the training process. Machine learning algorithms were used to build classification models. IRC systems were trained, validated, and tested using a repeated and randomly stratified shuffle-split cross-validation method. Finally, we evaluated the system performance using the F1-measure, precision, and recall over 15 stratified test sets.
The experimental results demonstrated that the classification system setting considering both data imbalance issues and structured features outperformed the other system settings (with a mean macro-averaged F1-measure of 0.733). Considering the structured features and resampling techniques, this classification system setting significantly improved the mean F1-measure for the rare class by 30.88% (P value < .001) and the mean macro-averaged F1-measure by 8.26% from the baseline system setting (P value < .001). In general, the classification system employing the random forest algorithm and random oversampling method outperformed the others.
Structured features provide essential information for categorizing the fall incident severity level. Resampling methods help rebalance the class distribution of the original incident report data, which improves the performance of machine learning models. The IRC framework presented in this study effectively automates the identification of fall incident reports by the severity level.
本研究旨在通过机器学习考虑数据不平衡问题和结构化特征,以改进跌倒事件严重程度级别的分类。
我们提出了一个事件报告分类(IRC)框架,通过解决类不平衡问题并纳入结构化属性来对医院内跌倒事件的严重程度级别进行分类。经过文本预处理后,从报告中提取了词袋特征、结构化文本特征和结构化临床特征。接下来,将重采样技术纳入训练过程。使用机器学习算法构建分类模型。使用重复随机分层洗牌分割交叉验证方法对IRC系统进行训练、验证和测试。最后,我们在15个分层测试集上使用F1值、精确率和召回率评估系统性能。
实验结果表明,同时考虑数据不平衡问题和结构化特征的分类系统设置优于其他系统设置(平均宏平均F1值为0.733)。考虑结构化特征和重采样技术,该分类系统设置使稀有类别的平均F1值从基线系统设置显著提高了30.88%(P值<0.001),平均宏平均F1值提高了8.26%(P值<0.001)。总体而言,采用随机森林算法和随机过采样方法的分类系统表现优于其他系统。
结构化特征为对跌倒事件严重程度级别进行分类提供了重要信息。重采样方法有助于重新平衡原始事件报告数据的类分布,从而提高机器学习模型的性能。本研究中提出的IRC框架有效地实现了按严重程度级别自动识别跌倒事件报告。