Gilboa D, Garg Akhil, Shapiro M, Meseguer M, Amar Y, Lustgarten N, Desai N, Shavit T, Silva V, Papatheodorou A, Chatziparasidou A, Angras S, Lee J H, Thiel L, Curchoe C L, Tauber Y, Seidman D S
AIVF Ltd, Tel Aviv, Israel.
IVIRMA Valencia, Valencia, Spain.
Reprod Biol Endocrinol. 2025 Jan 31;23(1):16. doi: 10.1186/s12958-025-01351-w.
Artificial intelligence (AI) models analyzing embryo time-lapse images have been developed to predict the likelihood of pregnancy following in vitro fertilization (IVF). However, limited research exists on methods ensuring AI consistency and reliability in clinical settings during its development and validation process. We present a methodology for developing and validating an AI model across multiple datasets to demonstrate reliable performance in evaluating blastocyst-stage embryos.
This multicenter analysis utilizes time-lapse images, pregnancy outcomes, and morphologic annotations from embryos collected at 10 IVF clinics across 9 countries between 2018 and 2022. The four-step methodology for developing and evaluating the AI model include: (I) curating annotated datasets that represent the intended clinical use case; (II) developing and optimizing the AI model; (III) evaluating the AI's performance by assessing its discriminative power and associations with pregnancy probability across variable data; and (IV) ensuring interpretability and explainability by correlating AI scores with relevant morphologic features of embryo quality. Three datasets were used: the training and validation dataset (n = 16,935 embryos), the blind test dataset (n = 1,708 embryos; 3 clinics), and the independent dataset (n = 7,445 embryos; 7 clinics) derived from previously unseen clinic cohorts.
The AI was designed as a deep learning classifier ranking embryos by score according to their likelihood of clinical pregnancy. Higher AI score brackets were associated with increased fetal heartbeat (FH) likelihood across all evaluated datasets, showing a trend of increasing odds ratios (OR). The highest OR was observed in the top G4 bracket (test dataset G4 score ≥ 7.5: OR 3.84; independent dataset G4 score ≥ 7.5: OR 4.01), while the lowest was in the G1 bracket (test dataset G1 score < 4.0: OR 0.40; independent dataset G1 score < 4.0: OR 0.45). AI score brackets G2, G3, and G4 displayed OR values above 1.0 (P < 0.05), indicating linear associations with FH likelihood. Average AI scores were consistently higher for FH-positive than for FH-negative embryos within each age subgroup. Positive correlations were also observed between AI scores and key morphologic parameters used to predict embryo quality.
Strong AI performance across multiple datasets demonstrates the value of our four-step methodology in developing and validating the AI as a reliable adjunct to embryo evaluation.
已经开发出分析胚胎延时图像的人工智能(AI)模型,以预测体外受精(IVF)后怀孕的可能性。然而,在其开发和验证过程中,关于确保AI在临床环境中的一致性和可靠性的方法的研究有限。我们提出了一种在多个数据集上开发和验证AI模型的方法,以证明其在评估囊胚期胚胎方面的可靠性能。
这项多中心分析利用了2018年至2022年间在9个国家的10家IVF诊所收集的胚胎的延时图像、妊娠结局和形态学注释。开发和评估AI模型的四步方法包括:(I)策划代表预期临床用例的注释数据集;(II)开发和优化AI模型;(III)通过评估其辨别力以及与可变数据中妊娠概率的关联来评估AI的性能;(IV)通过将AI分数与胚胎质量的相关形态学特征相关联来确保可解释性。使用了三个数据集:训练和验证数据集(n = 16,935个胚胎)、盲测数据集(n = 1,708个胚胎;3家诊所)和独立数据集(n = 7,445个胚胎;7家诊所),这些数据集来自以前未见过的临床队列。
AI被设计为一个深度学习分类器,根据胚胎临床妊娠的可能性按分数对其进行排名。在所有评估的数据集中,较高的AI分数区间与胎儿心跳(FH)可能性增加相关,显示出优势比(OR)增加的趋势。在最高的G4区间观察到最高的OR(测试数据集G4分数≥7.5:OR 3.84;独立数据集G4分数≥7.5:OR 4.01),而最低的在G1区间(测试数据集G1分数<4.0:OR 0.40;独立数据集G1分数<4.0:OR 0.45)。AI分数区间G2、G3和G4显示OR值高于1.0(P<0.05),表明与FH可能性呈线性关联。在每个年龄亚组中,FH阳性胚胎的平均AI分数始终高于FH阴性胚胎。在AI分数与用于预测胚胎质量的关键形态学参数之间也观察到正相关。
在多个数据集上强大的AI性能证明了我们的四步方法在开发和验证AI作为胚胎评估的可靠辅助工具方面的价值。