Experimental Research Center for Medical and Psychological Science (ERC-MPS), School of Psychology, Third Military Medical University, Chongqing, China.
Faculty of Psychology, Southwest University, Chongqing, China.
BMC Med. 2023 Jul 3;21(1):241. doi: 10.1186/s12916-023-02941-4.
The development of machine learning models for aiding in the diagnosis of mental disorder is recognized as a significant breakthrough in the field of psychiatry. However, clinical practice of such models remains a challenge, with poor generalizability being a major limitation.
Here, we conducted a pre-registered meta-research assessment on neuroimaging-based models in the psychiatric literature, quantitatively examining global and regional sampling issues over recent decades, from a view that has been relatively underexplored. A total of 476 studies (n = 118,137) were included in the current assessment. Based on these findings, we built a comprehensive 5-star rating system to quantitatively evaluate the quality of existing machine learning models for psychiatric diagnoses.
A global sampling inequality in these models was revealed quantitatively (sampling Gini coefficient (G) = 0.81, p < .01), varying across different countries (regions) (e.g., China, G = 0.47; the USA, G = 0.58; Germany, G = 0.78; the UK, G = 0.87). Furthermore, the severity of this sampling inequality was significantly predicted by national economic levels (β = - 2.75, p < .001, R = 0.40; r = - .84, 95% CI: - .41 to - .97), and was plausibly predictable for model performance, with higher sampling inequality for reporting higher classification accuracy. Further analyses showed that lack of independent testing (84.24% of models, 95% CI: 81.0-87.5%), improper cross-validation (51.68% of models, 95% CI: 47.2-56.2%), and poor technical transparency (87.8% of models, 95% CI: 84.9-90.8%)/availability (80.88% of models, 95% CI: 77.3-84.4%) are prevailing in current diagnostic classifiers despite improvements over time. Relating to these observations, model performances were found decreased in studies with independent cross-country sampling validations (all p < .001, BF > 15). In light of this, we proposed a purpose-built quantitative assessment checklist, which demonstrated that the overall ratings of these models increased by publication year but were negatively associated with model performance.
Together, improving sampling economic equality and hence the quality of machine learning models may be a crucial facet to plausibly translating neuroimaging-based diagnostic classifiers into clinical practice.
机器学习模型在辅助精神障碍诊断方面的发展被认为是精神病学领域的重大突破。然而,此类模型在临床实践中的应用仍然具有挑战性,泛化能力差是主要限制因素。
在这里,我们对精神病学文献中的基于神经影像学的模型进行了预先注册的元研究评估,从一个相对较少被探索的角度定量地检查了近几十年来的全球和区域抽样问题。共有 476 项研究(n=118137)被纳入本评估。基于这些发现,我们建立了一个全面的 5 星级评分系统,对现有的用于精神科诊断的机器学习模型的质量进行定量评估。
这些模型中定量揭示了全球抽样不平等(抽样基尼系数(G)=0.81,p<.01),不同国家(地区)之间存在差异(例如,中国,G=0.47;美国,G=0.58;德国,G=0.78;英国,G=0.87)。此外,国家经济水平显著预测了这种抽样不平等的严重程度(β=-2.75,p<.001,R=0.40;r=-0.84,95%置信区间:-0.41 至-0.97),并且可能对模型性能进行预测,报告的分类准确率越高,抽样不平等程度越高。进一步的分析表明,缺乏独立测试(84.24%的模型,95%置信区间:81.0-87.5%)、不当的交叉验证(51.68%的模型,95%置信区间:47.2-56.2%)和技术透明度差(87.8%的模型,95%置信区间:84.9-90.8%)/可用性(80.88%的模型,95%置信区间:77.3-84.4%)在当前的诊断分类器中普遍存在,尽管随着时间的推移有所改进。与这些观察结果一致的是,在具有独立跨国抽样验证的研究中,模型性能下降(均 p<.001,BF>15)。有鉴于此,我们提出了一个专门设计的定量评估清单,该清单表明,这些模型的总体评分随着出版年份的增加而提高,但与模型性能呈负相关。
总之,提高机器学习模型的抽样经济平等程度,从而提高模型质量,可能是将基于神经影像学的诊断分类器转化为临床实践的一个重要方面。