• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过TF-IDF和Word2vec文本分析研究反应行为:以2012年国际学生评估项目(PISA)解决问题过程数据为例

Investigating response behavior through TF-IDF and Word2vec text analysis: A case study of PISA 2012 problem-solving process data.

作者信息

Zhou Jing, Ye Zhanliang, Zhang Sheng, Geng Zhao, Han Ning, Yang Tao

机构信息

Collaborative Innovation Center of Assessment Towards Basic Education Quality, Beijing Normal University, No. 19, XinJieKouWai St., HaiDian District, Beijing, 100875, PR China Beijing, China.

出版信息

Heliyon. 2024 Aug 10;10(16):e35945. doi: 10.1016/j.heliyon.2024.e35945. eCollection 2024 Aug 30.

DOI:10.1016/j.heliyon.2024.e35945
PMID:39247276
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11379602/
Abstract

The process data in computer-based problem-solving evaluation is rich in valuable implicit information. However, its diverse and irregular structure poses challenges for effective feature extraction, leading to varying degrees of information loss in existing methods. Process-response behavior exhibits similarities to textual data in terms of the key units and contextual relationships. Despite the scarcity of relevant research, exploring text analysis methods for feature recognition in process data is significant. This study investigated the efficacy of Term Frequency-Inverse Document Frequency (TF-IDF) and Word to Vector (Word2vec) in extracting response behavior features and compared the predictive, analytical, and clustering effects of classical machine learning methods (supervised and unsupervised) on response behavior. An analysis of the PISA 2012 computer-based problem-solving dataset revealed that TF-IDF effectively extracted key response behaviors, whereas Word2vec captured effective features from sequenced response behaviors. In addition, in supervised machine learning using both methods, the random forest model based on TF-IDF performed the best, followed by the SVM model based on Word2vec. Word2vec-based models outperformed TF-IDF-based ones in the F1-score, accuracy, and recall (except for precision) across the logistic regression, k-nearest neighbor, and support vector machine algorithms. In unsupervised machine learning, the k-means algorithm effectively clustered different response behavior patterns extracted by these methods. The findings underscore the theoretical and methodological transferability of these text analysis methods in educational and psychological assessment contexts. This study offers valuable insights for research and practice in similar domains by yielding rich feature representations, supplementing fine-grained assessment evidence, fostering personalized learning, and introducing novel insights for educational assessment.

摘要

基于计算机的问题解决评估中的过程数据包含丰富的有价值的隐含信息。然而,其多样且不规则的结构给有效的特征提取带来了挑战,导致现有方法存在不同程度的信息损失。过程响应行为在关键单元和上下文关系方面与文本数据表现出相似性。尽管相关研究较少,但探索用于过程数据特征识别的文本分析方法具有重要意义。本研究调查了词频-逆文档频率(TF-IDF)和词向量(Word2vec)在提取响应行为特征方面的有效性,并比较了经典机器学习方法(监督式和非监督式)对响应行为的预测、分析和聚类效果。对2012年国际学生评估项目(PISA)基于计算机的问题解决数据集的分析表明,TF-IDF有效地提取了关键响应行为,而Word2vec从序列响应行为中捕获了有效特征。此外,在使用这两种方法的监督式机器学习中,基于TF-IDF的随机森林模型表现最佳,其次是基于Word2vec的支持向量机模型。在逻辑回归、k近邻和支持向量机算法中,基于Word2vec的模型在F1分数、准确率和召回率(除精确率外)方面优于基于TF-IDF的模型。在非监督式机器学习中,k均值算法有效地对这些方法提取的不同响应行为模式进行了聚类。研究结果强调了这些文本分析方法在教育和心理评估背景下的理论和方法可转移性。本研究通过产生丰富的特征表示、补充细粒度评估证据、促进个性化学习以及为教育评估引入新见解,为类似领域的研究和实践提供了有价值的见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/fc32806f189f/gr12.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/32adf4a71de9/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/0d5d63fc66a2/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/88f15fabc8ee/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/54a8f85a1eb6/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/8a32c8a14da3/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/83a50830ebeb/gr6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/e39d7139ea32/gr7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/56d023d02b1a/gr8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/02184a9ed5ed/gr9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/a172643e0987/gr10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/302cd122d875/gr11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/fc32806f189f/gr12.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/32adf4a71de9/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/0d5d63fc66a2/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/88f15fabc8ee/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/54a8f85a1eb6/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/8a32c8a14da3/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/83a50830ebeb/gr6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/e39d7139ea32/gr7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/56d023d02b1a/gr8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/02184a9ed5ed/gr9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/a172643e0987/gr10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/302cd122d875/gr11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e92/11379602/fc32806f189f/gr12.jpg

相似文献

1
Investigating response behavior through TF-IDF and Word2vec text analysis: A case study of PISA 2012 problem-solving process data.通过TF-IDF和Word2vec文本分析研究反应行为:以2012年国际学生评估项目(PISA)解决问题过程数据为例
Heliyon. 2024 Aug 10;10(16):e35945. doi: 10.1016/j.heliyon.2024.e35945. eCollection 2024 Aug 30.
2
Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec.基于改进 TF-IDF 和 Word2Vec 的旅游景点子类别的文本分类算法。
PLoS One. 2024 Oct 18;19(10):e0305095. doi: 10.1371/journal.pone.0305095. eCollection 2024.
3
A CNN-Based Framework for Predicting Public Emotion and Multi-Level Behaviors Based on Network Public Opinion.一种基于卷积神经网络的、基于网络舆情预测公众情绪和多层次行为的框架。
Front Psychol. 2022 Jun 23;13:909439. doi: 10.3389/fpsyg.2022.909439. eCollection 2022.
4
Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms.使用词频-逆文档频率和优化的机器学习算法对电影评论进行分类。
PeerJ Comput Sci. 2022 Mar 15;8:e914. doi: 10.7717/peerj-cs.914. eCollection 2022.
5
Question classification based on Bloom's taxonomy cognitive domain using modified TF-IDF and word2vec.基于布鲁姆认知领域分类法的改进 TF-IDF 与词向量的问题分类。
PLoS One. 2020 Mar 19;15(3):e0230442. doi: 10.1371/journal.pone.0230442. eCollection 2020.
6
AI-based disease category prediction model using symptoms from low-resource Ethiopian language: Afaan Oromo text.基于人工智能的疾病类别预测模型,利用来自资源匮乏的埃塞俄比亚语言(阿法尔语)的症状文本。
Sci Rep. 2024 May 16;14(1):11233. doi: 10.1038/s41598-024-62278-7.
7
Advancing equity in breast cancer care: natural language processing for analysing treatment outcomes in under-represented populations.推进乳腺癌护理中的公平性:自然语言处理分析代表性不足人群的治疗结果。
BMJ Health Care Inform. 2024 Jul 1;31(1):e100966. doi: 10.1136/bmjhci-2023-100966.
8
Software Requirements Classification Using Machine Learning Algorithms.使用机器学习算法进行软件需求分类
Entropy (Basel). 2020 Sep 21;22(9):1057. doi: 10.3390/e22091057.
9
Identification of offensive language in Urdu using semantic and embedding models.使用语义和嵌入模型识别乌尔都语中的冒犯性语言。
PeerJ Comput Sci. 2022 Dec 12;8:e1169. doi: 10.7717/peerj-cs.1169. eCollection 2022.
10
Multi-Ideology, Multiclass Online Extremism Dataset, and Its Evaluation Using Machine Learning.多意识形态、多类别的在线极端主义数据集及其机器学习评估。
Comput Intell Neurosci. 2023 Mar 1;2023:4563145. doi: 10.1155/2023/4563145. eCollection 2023.

本文引用的文献

1
A CNN-Based Framework for Predicting Public Emotion and Multi-Level Behaviors Based on Network Public Opinion.一种基于卷积神经网络的、基于网络舆情预测公众情绪和多层次行为的框架。
Front Psychol. 2022 Jun 23;13:909439. doi: 10.3389/fpsyg.2022.909439. eCollection 2022.
2
Assessment and Evaluation of Different Machine Learning Algorithms for Predicting Student Performance.评估和比较不同机器学习算法在预测学生表现方面的性能。
Comput Intell Neurosci. 2022 May 9;2022:4151487. doi: 10.1155/2022/4151487. eCollection 2022.
3
Analysis of the causes of inferiority feelings based on social media data with Word2Vec.
基于社交媒体数据的 Word2Vec 分析自卑感产生的原因。
Sci Rep. 2022 Mar 25;12(1):5218. doi: 10.1038/s41598-022-09075-2.
4
ProcData: An R Package for Process Data Analysis.ProcData:一个用于过程数据分析的 R 包。
Psychometrika. 2021 Dec;86(4):1058-1083. doi: 10.1007/s11336-021-09798-7. Epub 2021 Aug 11.
5
Latent Theme Dictionary Model for Finding Co-occurrent Patterns in Process Data.潜在主题词典模型在过程数据中发现并发模式。
Psychometrika. 2020 Sep;85(3):775-811. doi: 10.1007/s11336-020-09725-2. Epub 2020 Sep 14.
6
Latent Feature Extraction for Process Data via Multidimensional Scaling.基于多维尺度分析的过程数据潜在特征提取。
Psychometrika. 2020 Jun;85(2):378-397. doi: 10.1007/s11336-020-09708-3. Epub 2020 Jun 22.
7
An exploratory analysis of the latent structure of process data via action sequence autoencoders.通过动作序列自动编码器对过程数据的潜在结构进行探索性分析。
Br J Math Stat Psychol. 2021 Feb;74(1):1-33. doi: 10.1111/bmsp.12203. Epub 2020 May 22.
8
Predictive Feature Generation and Selection Using Process Data From PISA Interactive Problem-Solving Items: An Application of Random Forests.使用国际学生评估项目(PISA)交互式解决问题项目的过程数据进行预测性特征生成与选择:随机森林的应用
Front Psychol. 2019 Nov 21;10:2461. doi: 10.3389/fpsyg.2019.02461. eCollection 2019.
9
Exploring Multiple Goals Balancing in Complex Problem Solving Based on Log Data.基于日志数据探索复杂问题解决中的多目标平衡
Front Psychol. 2019 Sep 27;10:1975. doi: 10.3389/fpsyg.2019.01975. eCollection 2019.
10
Assessment of Collaborative Problem Solving Based on Process Stream Data: A New Paradigm for Extracting Indicators and Modeling Dyad Data.基于流程流数据的协作问题解决评估:提取指标和对二元数据建模的新范式。
Front Psychol. 2019 Feb 26;10:369. doi: 10.3389/fpsyg.2019.00369. eCollection 2019.