Rahman Quazi Abidur, Janmohamed Tahir, Clarke Hance, Ritvo Paul, Heffernan Jane, Katz Joel
Department of Computer Science, Lakehead University, Thunder Bay, ON, Canada.
Centre for Disease Modelling, Department of Mathematics and Statistics, York University, Toronto, ON, Canada.
JMIR Med Inform. 2019 Nov 20;7(4):e15601. doi: 10.2196/15601.
Pain volatility is an important factor in chronic pain experience and adaptation. Previously, we employed machine-learning methods to define and predict pain volatility levels from users of the Manage My Pain app. Reducing the number of features is important to help increase interpretability of such prediction models. Prediction results also need to be consolidated from multiple random subsamples to address the class imbalance issue.
This study aimed to: (1) increase the interpretability of previously developed pain volatility models by identifying the most important features that distinguish high from low volatility users; and (2) consolidate prediction results from models derived from multiple random subsamples while addressing the class imbalance issue.
A total of 132 features were extracted from the first month of app use to develop machine learning-based models for predicting pain volatility at the sixth month of app use. Three feature selection methods were applied to identify features that were significantly better predictors than other members of the large features set used for developing the prediction models: (1) Gini impurity criterion; (2) information gain criterion; and (3) Boruta. We then combined the three groups of important features determined by these algorithms to produce the final list of important features. Three machine learning methods were then employed to conduct prediction experiments using the selected important features: (1) logistic regression with ridge estimators; (2) logistic regression with least absolute shrinkage and selection operator; and (3) random forests. Multiple random under-sampling of the majority class was conducted to address class imbalance in the dataset. Subsequently, a majority voting approach was employed to consolidate prediction results from these multiple subsamples. The total number of users included in this study was 879, with a total number of 391,255 pain records.
A threshold of 1.6 was established using clustering methods to differentiate between 2 classes: low volatility (n=694) and high volatility (n=185). The overall prediction accuracy is approximately 70% for both random forests and logistic regression models when using 132 features. Overall, 9 important features were identified using 3 feature selection methods. Of these 9 features, 2 are from the app use category and the other 7 are related to pain statistics. After consolidating models that were developed using random subsamples by majority voting, logistic regression models performed equally well using 132 or 9 features. Random forests performed better than logistic regression methods in predicting the high volatility class. The consolidated accuracy of random forests does not drop significantly (601/879; 68.4% vs 618/879; 70.3%) when only 9 important features are included in the prediction model.
We employed feature selection methods to identify important features in predicting future pain volatility. To address class imbalance, we consolidated models that were developed using multiple random subsamples by majority voting. Reducing the number of features did not result in a significant decrease in the consolidated prediction accuracy.
疼痛波动性是慢性疼痛体验和适应过程中的一个重要因素。此前,我们运用机器学习方法从“管理我的疼痛”应用程序的用户中定义并预测疼痛波动水平。减少特征数量对于提高此类预测模型的可解释性很重要。预测结果还需要从多个随机子样本中进行整合,以解决类别不平衡问题。
本研究旨在:(1)通过识别区分高波动用户和低波动用户的最重要特征,提高先前开发的疼痛波动模型的可解释性;(2)整合来自多个随机子样本的模型的预测结果,同时解决类别不平衡问题。
从应用程序使用的第一个月提取了总共132个特征,以开发基于机器学习的模型,用于预测应用程序使用第六个月时的疼痛波动性。应用了三种特征选择方法来识别比用于开发预测模型的大特征集中的其他成员显著更好的预测特征:(1)基尼不纯度准则;(2)信息增益准则;(3)Boruta。然后,我们将由这些算法确定的三组重要特征进行组合,以生成重要特征的最终列表。然后使用三种机器学习方法,利用选定的重要特征进行预测实验:(1)带岭估计器的逻辑回归;(2)带最小绝对收缩和选择算子的逻辑回归;(3)随机森林。对多数类进行多次随机欠采样,以解决数据集中的类别不平衡问题。随后,采用多数投票方法整合这些多个子样本的预测结果。本研究纳入的用户总数为879人,共有391,255条疼痛记录。
使用聚类方法确定了1.6的阈值,以区分两个类别:低波动性(n = 694)和高波动性(n = 185)。使用132个特征时,随机森林和逻辑回归模型的总体预测准确率约为70%。总体而言,使用三种特征选择方法识别出了9个重要特征。在这9个特征中,2个来自应用程序使用类别,另外7个与疼痛统计相关。通过多数投票整合使用随机子样本开发的模型后,逻辑回归模型使用132个或9个特征时表现同样出色。在预测高波动类别方面,随机森林的表现优于逻辑回归方法。当预测模型中仅包含9个重要特征时,随机森林的整合准确率没有显著下降(601/879;68.4%对618/879;70.3%)。
我们采用特征选择方法来识别预测未来疼痛波动性的重要特征。为了解决类别不平衡问题,我们通过多数投票整合了使用多个随机子样本开发的模型。减少特征数量并未导致整合预测准确率显著下降。