Department of Computer Science, Aalto University School of Science, Espoo, Finland.
BMC Med Res Methodol. 2022 Mar 6;22(1):60. doi: 10.1186/s12874-021-01502-8.
Developing machine learning models to support health analytics requires increased understanding about statistical properties of self-rated expression statements used in health-related communication and decision making. To address this, our current research analyzes self-rated expression statements concerning the coronavirus COVID-19 epidemic and with a new methodology identifies how statistically significant differences between groups of respondents can be linked to machine learning results.
A quantitative cross-sectional study gathering the "need for help" ratings for twenty health-related expression statements concerning the coronavirus epidemic on an 11-point Likert scale, and nine answers about the person's health and wellbeing, sex and age. The study involved online respondents between 30 May and 3 August 2020 recruited from Finnish patient and disabled people's organizations, other health-related organizations and professionals, and educational institutions (n = 673). We propose and experimentally motivate a new methodology of influence analysis concerning machine learning to be applied for evaluating how machine learning results depend on and are influenced by various properties of the data which are identified with traditional statistical methods.
We found statistically significant Kendall rank-correlations and high cosine similarity values between various health-related expression statement pairs concerning the "need for help" ratings and a background question pair. With tests of Wilcoxon rank-sum, Kruskal-Wallis and one-way analysis of variance (ANOVA) between groups we identified statistically significant rating differences for several health-related expression statements in respect to groupings based on the answer values of background questions, such as the ratings of suspecting to have the coronavirus infection and having it depending on the estimated health condition, quality of life and sex. Our new methodology enabled us to identify how statistically significant rating differences were linked to machine learning results thus helping to develop better human-understandable machine learning models.
The self-rated "need for help" concerning health-related expression statements differs statistically significantly depending on the person's background information, such as his/her estimated health condition, quality of life and sex. With our new methodology statistically significant rating differences can be linked to machine learning results thus enabling to develop better machine learning to identify, interpret and address the patient's needs for well-personalized care.
为了支持健康分析,开发机器学习模型需要提高对健康相关沟通和决策中自评表达语句的统计属性的理解。为了解决这个问题,我们当前的研究分析了与冠状病毒 COVID-19 疫情相关的自评表达语句,并采用新的方法学确定了如何将受访者群体之间具有统计学意义的差异与机器学习结果联系起来。
一项定量的横断面研究,使用 11 点 Likert 量表收集了 20 个与冠状病毒疫情相关的健康相关表达语句的“需要帮助”评分,以及 9 个关于个人健康和幸福、性别和年龄的问题的答案。该研究于 2020 年 5 月 30 日至 8 月 3 日期间在网上进行,从芬兰患者和残疾人组织、其他健康相关组织和专业人士以及教育机构招募了参与者(n=673)。我们提出并实验性地激发了一种关于机器学习的影响分析的新方法学,以评估机器学习结果如何依赖于并受传统统计方法识别的数据的各种特性的影响。
我们发现,关于“需要帮助”评分的各种健康相关表达语句之间存在具有统计学意义的 Kendall 等级相关和高余弦相似度值,以及一对背景问题。通过对各组之间的 Wilcoxon 秩和检验、Kruskal-Wallis 检验和单向方差分析(ANOVA)检验,我们发现,对于基于背景问题答案的分组,如对疑似感染冠状病毒和感染冠状病毒的评分,以及对估计的健康状况、生活质量和性别的评分,一些健康相关表达语句的评分存在具有统计学意义的差异。我们的新方法学使我们能够确定具有统计学意义的评分差异与机器学习结果的联系,从而帮助开发更易于人类理解的机器学习模型。
关于健康相关表达语句的自评“需要帮助”在统计学上有显著差异,这取决于个人的背景信息,如他/她的估计健康状况、生活质量和性别。通过我们的新方法学,可以将具有统计学意义的评分差异与机器学习结果联系起来,从而开发更好的机器学习来识别、解释和满足患者对个性化护理的需求。