Strale Frederick, Riddle Isaac, Geng Bowen, Oxford Blake, Kah Malia, Sherwin Robert
Biostatistics, The Oxford Center, Brighton, USA.
Information Technology, The Oxford Center, Brighton, USA.
Cureus. 2025 Apr 10;17(4):e82005. doi: 10.7759/cureus.82005. eCollection 2025 Apr.
Background This research compared the simple and advanced statistical results of SPSS (IBM Corp., Armonk, NY, USA) with ChatGPT-4 and ChatGPT o3-mini (OpenAI, San Francisco, CA, USA) in statistical data output and interpretation with behavioral healthcare data. It evaluated their methodological approaches, quantitative performance, interpretability, adaptability, ethical considerations, and future trends. Methods Fourteen statistical analyses were conducted from two real datasets that produced peer-reviewed, published scientific articles in 2024. Descriptive statistics, Pearson r, multiple correlation with Pearson r, Spearman's rho, simple linear regression, one-sample t-test, paired t-test, two-independent sample t-test, multiple linear regression, one-way analysis of variance (ANOVA), repeated measures ANOVA, two-way (factorial) ANOVA, and multivariate ANOVA were computed. The two datasets adhered to a systematically structured timeframe, March 19, 2023, through June 11, 2023, and June 7, 2023, through July 7, 2023, thereby ensuring the integrity and temporal representativeness of the data gathering. The analyses were conducted by inputting the verbal (text) commands into ChatGPT-4 and ChatGPT o3-mini along with the relevant SPSS variables, which were copied and pasted from the SPSS datasets. Results The study found high concordance between SPSS and ChatGPT-4 in fundamental statistical analyses, such as measures of central tendency, variability, and simple Pearson and Spearman correlation analyses, where the results were nearly identical. ChatGPT-4 also closely matched SPSS in the three t-tests and simple linear regression, with minimal effect size variations. Discrepancies emerged in complex analyses. ChatGPT o3-mini showed inflated correlation values and significant results where none were expected, indicating computational deviations. ChatGPT o3-mini produced inflated coefficients in the multiple correlation and R-squared values in two-way ANOVA and multiple regression, suggesting differing assumptions. ChatGPT-4 and ChatGPT o3-mini produced identical F-statistics with repeated measures ANOVA but reported incorrect degrees of freedom (df) values. While ChatGPT-4 performed well in one-way ANOVA, it miscalculated degrees of freedom in multivariate ANOVA (MANOVA), leading to significant discrepancies. ChatGPT o3-mini also generated erroneous F-statistics in factorial ANOVA, highlighting the need for further optimization in multivariate statistical modeling. Conclusions This study underscored the rapid advancements in artificial intelligence (AI)-driven statistical analyses while highlighting areas that require further refinement. ChatGPT-4 accurately executed fundamental statistical tests, closely matching SPSS. However, its reliability diminished in more advanced statistical procedures, requiring further validation. ChatGPT o3-mini, while optimized for Science, Technology, Engineering, and Mathematics (STEM) applications, produced inconsistencies in correlation and multivariate analyses, limiting its dependability for complex research applications. Ensuring its alignment with established statistical methodologies will be essential for widespread scientific research adoption as AI evolves.
背景 本研究将SPSS(美国国际商业机器公司,纽约州阿蒙克)的简单和高级统计结果与ChatGPT-4以及ChatGPT o3-mini(美国加利福尼亚州旧金山OpenAI公司)在行为健康护理数据的统计数据输出和解释方面进行了比较。研究评估了它们的方法学途径、定量性能、可解释性、适应性、伦理考量以及未来趋势。
方法 从两个真实数据集进行了14项统计分析,这些数据集产生了在2024年经过同行评审并发表的科学文章。计算了描述性统计、皮尔逊相关系数r、与皮尔逊相关系数r的多重相关、斯皮尔曼等级相关系数、简单线性回归、单样本t检验、配对t检验、两独立样本t检验、多重线性回归、单因素方差分析、重复测量方差分析、双因素(析因)方差分析以及多变量方差分析。这两个数据集遵循系统结构化的时间框架,即2023年3月19日至2023年6月11日以及2023年6月7日至2023年7月7日,从而确保了数据收集的完整性和时间代表性。通过将语言(文本)命令与相关的SPSS变量一起输入ChatGPT-4和ChatGPT o3-mini进行分析,这些变量是从SPSS数据集中复制粘贴而来的。
结果 研究发现,在基本统计分析中,如集中趋势测量、变异性以及简单的皮尔逊和斯皮尔曼相关分析,SPSS与ChatGPT-4之间具有高度一致性,结果几乎相同。ChatGPT-4在三个t检验和简单线性回归中也与SPSS紧密匹配,效应大小差异最小。在复杂分析中出现了差异。ChatGPT o3-mini显示出相关性值膨胀以及在无预期显著结果的地方出现了显著结果,表明存在计算偏差。ChatGPT o3-mini在多重相关中产生了膨胀的系数,在双因素方差分析和多重回归中产生了膨胀的R平方值,表明假设不同。ChatGPT-4和ChatGPT o3-mini在重复测量方差分析中产生了相同的F统计量,但报告的自由度(df)值不正确。虽然ChatGPT-4在单因素方差分析中表现良好,但在多变量方差分析(MANOVA)中错误计算了自由度,导致显著差异。ChatGPT o3-mini在析因方差分析中也产生了错误的F统计量,凸显了在多变量统计建模中进一步优化的必要性。
结论 本研究强调了人工智能(AI)驱动的统计分析的快速进展,同时突出了需要进一步完善的领域。ChatGPT-4准确执行了基本统计测试,与SPSS紧密匹配。然而,在更高级的统计程序中其可靠性降低,需要进一步验证。ChatGPT o3-mini虽然针对科学、技术、工程和数学(STEM)应用进行了优化,但在相关性和多变量分析中产生了不一致性,限制了其在复杂研究应用中的可靠性。随着AI的发展,确保其与既定统计方法保持一致对于广泛的科学研究应用至关重要。