Suppr超能文献

使用语言增强机器学习算法预测国际大学生英语健康材料的难易程度:开发与验证研究

Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students With Linguistically Enhanced Machine Learning Algorithms: Development and Validation Study.

作者信息

Xie Wenxiu, Ji Christine, Hao Tianyong, Chow Chi-Yin

机构信息

Department of Computer Science, City University of Hong Kong, Hong Kong, Hong Kong.

School of Languages and Cultures, University of Sydney, Sydney, Australia.

出版信息

JMIR Med Inform. 2021 Oct 26;9(10):e25110. doi: 10.2196/25110.

Abstract

BACKGROUND

There is an increasing body of research on the development of machine learning algorithms in the evaluation of online health educational resources for specific readerships. Machine learning algorithms are known for their lack of interpretability compared with statistics. Given their high predictive precision, improving the interpretability of these algorithms can help increase their applicability and replicability in health educational research and applied linguistics, as well as in the development and review of new health education resources for effective and accessible health education.

OBJECTIVE

Our study aimed to develop a linguistically enriched machine learning model to predict binary outcomes of online English health educational resources in terms of their easiness and complexity for international tertiary students.

METHODS

Logistic regression emerged as the best performing algorithm compared with support vector machine (SVM) (linear), SVM (radial basis function), random forest, and extreme gradient boosting on the transformed data set using L2 normalization. We applied recursive feature elimination with SVM to perform automatic feature selection. The automatically selected features (n=67) were then further streamlined through expert review. The finalized feature set of 22 semantic features achieved a similar area under the curve, sensitivity, specificity, and accuracy compared with the initial (n=115) and automatically selected feature sets (n=67). Logistic regression with the linguistically enhanced feature set (n=22) exhibited important stability and robustness on the training data of different sizes (20%, 40%, 60%, and 80%), and showed consistently high performance when compared with the other 4 algorithms (SVM [linear], SVM [radial basis function], random forest, and extreme gradient boosting).

RESULTS

We identified semantic features (with positive regression coefficients) contributing to the prediction of easy-to-understand online health texts and semantic features (with negative regression coefficients) contributing to the prediction of hard-to-understand health materials for readers with nonnative English backgrounds. Language complexity was explained by lexical difficulty (rarity and medical terminology), verbs typical of medical discourse, and syntactic complexity. Language easiness of online health materials was associated with features such as common speech act verbs, personal pronouns, and familiar reasoning verbs. Successive permutation of features illustrated the interaction between these features and their impact on key performance indicators of the machine learning algorithms.

CONCLUSIONS

The new logistic regression model developed exhibited consistency, scalability, and, more importantly, interpretability based on existing health and linguistic research. It was found that low and high linguistic accessibilities of online health materials were explained by 2 sets of distinct semantic features. This revealed the inherent complexity of effective health communication beyond current readability analyses, which were limited to syntactic complexity and lexical difficulty.

摘要

背景

关于机器学习算法在评估针对特定受众的在线健康教育资源方面的发展,研究越来越多。与统计学方法相比,机器学习算法缺乏可解释性。鉴于其高预测精度,提高这些算法的可解释性有助于增强其在健康教育研究、应用语言学以及有效且易懂的健康教育新资源开发与审查中的适用性和可重复性。

目的

我们的研究旨在开发一种语言丰富的机器学习模型,以预测在线英语健康教育资源对于国际大学生而言的难易程度这一二元结果。

方法

与支持向量机(SVM)(线性)、支持向量机(径向基函数)、随机森林和极端梯度提升算法相比,在使用L2归一化的转换数据集上,逻辑回归算法表现最佳。我们应用带有支持向量机的递归特征消除来进行自动特征选择。然后通过专家评审进一步精简自动选择的特征(n = 67)。与初始特征集(n = 115)和自动选择的特征集(n = 67)相比,最终确定的包含22个语义特征的特征集在曲线下面积、灵敏度、特异性和准确性方面表现相似。具有语言增强特征集(n = 22)的逻辑回归在不同规模(20%、40%、60%和80%)的训练数据上表现出重要的稳定性和稳健性,并且与其他4种算法(支持向量机[线性]、支持向量机[径向基函数]、随机森林和极端梯度提升)相比始终表现出高性能。

结果

我们确定了有助于预测易于理解的在线健康文本的语义特征(具有正回归系数)以及有助于预测非英语母语背景读者难以理解的健康材料的语义特征(具有负回归系数)。语言复杂性可通过词汇难度(罕见性和医学术语)、医学话语典型的动词以及句法复杂性来解释。在线健康材料的语言易懂性与诸如常见言语行为动词、人称代词和常见推理动词等特征相关。特征的连续排列说明了这些特征之间的相互作用及其对机器学习算法关键性能指标的影响。

结论

所开发的新逻辑回归模型基于现有的健康和语言研究展现出一致性、可扩展性,更重要的是可解释性。研究发现,在线健康材料的低语言可及性和高语言可及性可由两组不同的语义特征来解释。这揭示了有效健康传播超出当前仅限于句法复杂性和词汇难度的可读性分析的内在复杂性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e95/8579219/6ce73e6bdbde/medinform_v9i10e25110_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验