D'Alberti Elena, Patey Olga, Smith Carolyn, Šalović Bojana, Hernandez-Cruz Netzahualcoyotl, Noble J Alison, Papageorghiou Aris T
Nuffield Department of Women's & Reproductive Health, University of Oxford, Oxford, United Kingdom.
Department of Maternal and Child Health and Urological Sciences, Sapienza University of Rome, Rome, Italy.
EClinicalMedicine. 2025 May 30;84:103250. doi: 10.1016/j.eclinm.2025.103250. eCollection 2025 Jun.
Advances in artificial intelligence (AI) have triggered interest in using intelligent systems to improve prenatal detection of fetal congenital heart defects (CHDs). Our aim is to systematically examine the current literature on diagnostic performance of AI-enabled prenatal cardiac ultrasound.
This systematic review and meta-analysis was registered with PROSPERO (CRD42024549601). Embase, Medline, Cochrane Central Database of Controlled Trials, and CINAHL were searched from inception until February 2025. Studies evaluating AI performance in prenatal detection of fetal CHDs were eligible for inclusion, and studies focusing on the application of AI before 16 weeks of gestation, or using three- or four-dimensional ultrasound, were excluded. Pooled sensitivity and specificity were obtained using random-effect method, and pooled proportions using the Freeman-Tukey arcsine square root transformation. Heterogeneity was assessed with I statistics. Risk of bias and adherence to reporting standards were assessed using QUADAS-2 and TRIPOD+AI, respectively. Risk of publication bias was assessed with Deek's test and certainty of evidence for outcomes with GRADE approach.
Fifteen studies were included, of which fourteen developed and evaluated a model and one externally evaluated a previously trained model. Images and videos obtained during cardiac screening or fetal echocardiography of 30.121 fetuses were used for training, validation and testing. For the binary task of classifying heart as normal or abnormal, AI models achieved a pooled sensitivity of 0.89 (95% CI 0.83-0.93, I = 77.92%) and specificity of 0.91 (95% CI 0.84-0.95, I = 77.92%). The subgroup analysis showed that models tested on various CHDs exhibited lower sensitivity compared to those tested for a specific cardiac abnormality (0.85; 95% CI 0.75-0.91 vs 0.92; 95% CI 0.87-0.96), while specificity remained comparable (0.90; 95% CI 0.79-0.96 vs 0.91; 95% CI 0.81-0.97). Overall, AI models performed better than operators with lower expertise and were nearly comparable to experts; however, the human comparator group (median six clinicians, IQR 3-10) was usually small and non-blinded. Relevant sources of heterogeneity were the types of cardiac views collected, the prevalence of CHDs across different datasets, and the types of CHDs examined. The risk of bias was moderate-high and adherence to reporting standards low (>70% in 18/51 TRIPOD+AI items). The risk of publication bias was not statistically significant (Deek's test p = 0.474).
These findings suggest that AI models perform better than clinicians with lower expertise, but this must be interpreted with caution due to the high risk of bias and sources of heterogeneity.
This study was partly supported by the InnoHK-funded Hong Kong Centre for Cerebro-cardiovascular Health Engineering (COCHE) Project 2.1 (Cardiovascular risks in early life and fetal echocardiography). ATP and JAN are supported by the National Institute for Health and Care Research (NIHR) Oxford Biomedical Research Centre (BRC).
人工智能(AI)的进展引发了人们对使用智能系统改善胎儿先天性心脏病(CHD)产前检测的兴趣。我们的目的是系统地审查当前关于人工智能辅助产前心脏超声诊断性能的文献。
本系统评价和荟萃分析已在国际前瞻性系统评价注册库(PROSPERO,注册号CRD42024549601)登记。检索了Embase、Medline、Cochrane对照试验中央数据库和护理学与健康领域数据库(CINAHL),检索时间从建库至2025年2月。纳入评估人工智能在胎儿先天性心脏病产前检测中性能的研究,排除聚焦于妊娠16周前人工智能应用或使用三维或四维超声的研究。采用随机效应模型获得合并敏感性和特异性,使用弗里曼-图基反正弦平方根变换获得合并比例。采用I²统计量评估异质性。分别使用QUADAS-2和TRIPOD+AI评估偏倚风险和报告标准的遵循情况。采用迪克斯检验评估发表偏倚风险,采用GRADE方法评估结果的证据确定性。
纳入15项研究,其中14项开发并评估了一个模型,1项对先前训练的模型进行了外部评估。在心脏筛查或胎儿超声心动图检查期间获取的30121例胎儿的图像和视频用于训练、验证和测试。对于将心脏分类为正常或异常的二元任务,人工智能模型的合并敏感性为0.89(95%CI 0.83-0.93,I²=77.92%),特异性为0.91(95%CI 0.84-0.95,I²=77.92%)。亚组分析显示,与针对特定心脏异常进行测试的模型相比,在各种先天性心脏病上进行测试的模型敏感性较低(0.85;95%CI 0.75-0.91对比0.92;95%CI 0.87-0.96),而特异性相当(0.90;95%CI 0.79-0.96对比0.91;95%CI 0.81-0.97)。总体而言,人工智能模型的表现优于专业知识较少的操作人员,且与专家的表现相近;然而,人工比较组(中位数为6名临床医生,四分位间距为3-10)通常规模较小且未设盲。异质性的相关来源包括收集的心脏视图类型、不同数据集中先天性心脏病的患病率以及所检查的先天性心脏病类型。偏倚风险为中度至高度,报告标准的遵循率较低(在TRIPOD+AI的51项条目中,18项的遵循率>70%)。发表偏倚风险无统计学意义(迪克斯检验p=0.474)。
这些发现表明,人工智能模型的表现优于专业知识较少的临床医生,但由于偏倚风险高和异质性来源,对此必须谨慎解读。
本研究部分得到香港创新科技署资助的香港心血管健康工程中心(COCHE)项目2.1(生命早期心血管风险与胎儿超声心动图)的支持。ATP和JAN得到英国国家卫生与保健研究所(NIHR)牛津生物医学研究中心(BRC)的支持。