Embryonics, Embryonics R&D Center, Haifa, Israel.
Computer Science, Technion-Israel Institute of Technology, Haifa, Israel.
Hum Reprod. 2022 Sep 30;37(10):2275-2290. doi: 10.1093/humrep/deac171.
What is the accuracy and agreement of embryologists when assessing the implantation probability of blastocysts using time-lapse imaging (TLI), and can it be improved with a data-driven algorithm?
The overall interobserver agreement of a large panel of embryologists was moderate and prediction accuracy was modest, while the purpose-built artificial intelligence model generally resulted in higher performance metrics.
Previous studies have demonstrated significant interobserver variability amongst embryologists when assessing embryo quality. However, data concerning embryologists' ability to predict implantation probability using TLI is still lacking. Emerging technologies based on data-driven tools have shown great promise for improving embryo selection and predicting clinical outcomes.
STUDY DESIGN, SIZE, DURATION: TLI video files of 136 embryos with known implantation data were retrospectively collected from two clinical sites between 2018 and 2019 for the performance assessment of 36 embryologists and comparison with a deep neural network (DNN).
PARTICIPANTS/MATERIALS, SETTING, METHODS: We recruited 39 embryologists from 13 different countries. All participants were blinded to clinical outcomes. A total of 136 TLI videos of embryos that reached the blastocyst stage were used for this experiment. Each embryo's likelihood of successfully implanting was assessed by 36 embryologists, providing implantation probability grades (IPGs) from 1 to 5, where 1 indicates a very low likelihood of implantation and 5 indicates a very high likelihood. Subsequently, three embryologists with over 5 years of experience provided Gardner scores. All 136 blastocysts were categorized into three quality groups based on their Gardner scores. Embryologist predictions were then converted into predictions of implantation (IPG ≥ 3) and no implantation (IPG ≤ 2). Embryologists' performance and agreement were assessed using Fleiss kappa coefficient. A 10-fold cross-validation DNN was developed to provide IPGs for TLI video files. The model's performance was compared to that of the embryologists.
Logistic regression was employed for the following confounding variables: country of residence, academic level, embryo scoring system, log years of experience and experience using TLI. None were found to have a statistically significant impact on embryologist performance at α = 0.05. The average implantation prediction accuracy for the embryologists was 51.9% for all embryos (N = 136). The average accuracy of the embryologists when assessing top quality and poor quality embryos (according to the Gardner score categorizations) was 57.5% and 57.4%, respectively, and 44.6% for fair quality embryos. Overall interobserver agreement was moderate (κ = 0.56, N = 136). The best agreement was achieved in the poor + top quality group (κ = 0.65, N = 77), while the agreement in the fair quality group was lower (κ = 0.25, N = 59). The DNN showed an overall accuracy rate of 62.5%, with accuracies of 62.2%, 61% and 65.6% for the poor, fair and top quality groups, respectively. The AUC for the DNN was higher than that of the embryologists overall (0.70 DNN vs 0.61 embryologists) as well as in all of the Gardner groups (DNN vs embryologists-Poor: 0.69 vs 0.62; Fair: 0.67 vs 0.53; Top: 0.77 vs 0.54).
LIMITATIONS, REASONS FOR CAUTION: Blastocyst assessment was performed using video files acquired from time-lapse incubators, where each video contained data from a single focal plane. Clinical data regarding the underlying cause of infertility and endometrial thickness before the transfer was not available, yet may explain implantation failure and lower accuracy of IPGs. Implantation was defined as the presence of a gestational sac, whereas the detection of fetal heartbeat is a more robust marker of embryo viability. The raw data were anonymized to the extent that it was not possible to quantify the number of unique patients and cycles included in the study, potentially masking the effect of bias from a limited patient pool. Furthermore, the lack of demographic data makes it difficult to draw conclusions on how representative the dataset was of the wider population. Finally, embryologists were required to assess the implantation potential, not embryo quality. Although this is not the traditional approach to embryo evaluation, morphology/morphokinetics as a means of assessing embryo quality is believed to be strongly correlated with viability and, for some methods, implantation potential.
Embryo selection is a key element in IVF success and continues to be a challenge. Improving the predictive ability could assist in optimizing implantation success rates and other clinical outcomes and could minimize the financial and emotional burden on the patient. This study demonstrates moderate agreement rates between embryologists, likely due to the subjective nature of embryo assessment. In particular, we found that average embryologist accuracy and agreement were significantly lower for fair quality embryos when compared with that for top and poor quality embryos. Using data-driven algorithms as an assistive tool may help IVF professionals increase success rates and promote much needed standardization in the IVF clinic. Our results indicate a need for further research regarding technological advancement in this field.
STUDY FUNDING/COMPETING INTEREST(S): Embryonics Ltd is an Israel-based company. Funding for the study was partially provided by the Israeli Innovation Authority, grant #74556.
N/A.
胚胎学家使用延时成像(TLI)评估囊胚着床概率的准确性和一致性如何?是否可以通过数据驱动算法来提高?
当使用延时成像(TLI)评估囊胚的植入概率时,一个大型胚胎学家小组的整体观察者间一致性为中等,预测准确性也一般,而专门构建的人工智能模型通常会产生更高的性能指标。
先前的研究已经表明,胚胎学家在评估胚胎质量时存在显著的观察者间变异性。然而,关于胚胎学家使用 TLI 预测植入概率的能力的数据仍然缺乏。基于数据驱动工具的新兴技术在提高胚胎选择和预测临床结果方面显示出巨大的潜力。
研究设计、规模、持续时间:在 2018 年至 2019 年期间,从两个临床中心回顾性收集了 136 个具有已知植入数据的 TLI 视频文件,用于 36 名胚胎学家的性能评估,并与深度神经网络(DNN)进行比较。
参与者/材料、设置、方法:我们从 13 个不同的国家招募了 39 名胚胎学家。所有参与者均对临床结果不知情。总共使用 136 个 TLI 视频评估了达到囊胚阶段的 136 个胚胎,每个胚胎的成功植入概率由 36 名胚胎学家评估,提供从 1 到 5 的植入概率等级(IPG),其中 1 表示植入的可能性非常低,5 表示植入的可能性非常高。随后,三名拥有超过 5 年经验的胚胎学家提供了 Gardner 评分。所有 136 个囊胚根据他们的 Gardner 评分分为三个质量组。然后将胚胎学家的预测转换为植入(IPG≥3)和未植入(IPG≤2)的预测。使用 Fleiss kappa 系数评估胚胎学家的表现和一致性。开发了 10 倍交叉验证 DNN 为 TLI 视频文件提供 IPG。将模型的性能与胚胎学家的性能进行了比较。
使用逻辑回归分析了以下混淆变量:居住国家、学术水平、胚胎评分系统、对数年经验和使用 TLI 的经验。在α=0.05 时,没有发现这些变量对胚胎学家的表现有统计学意义的影响。所有胚胎的平均植入预测准确性为 51.9%(N=136)。评估优质和不良质量胚胎(根据 Gardner 评分分类)的胚胎学家的平均准确率分别为 57.5%和 57.4%,而评估一般质量胚胎的准确率为 44.6%。整体观察者间一致性为中等(κ=0.56,N=136)。在不良+优质质量组中达到了最佳一致性(κ=0.65,N=77),而在一般质量组中的一致性较低(κ=0.25,N=59)。DNN 的整体准确率为 62.5%,在不良、一般和优质质量组中的准确率分别为 62.2%、61%和 65.6%。DNN 的 AUC 高于胚胎学家的整体 AUC(0.70 DNN 与 0.61 胚胎学家)以及所有 Gardner 组(DNN 与胚胎学家-不良:0.69 与 0.62;一般:0.67 与 0.53;优质:0.77 与 0.54)。
局限性、谨慎的原因:囊胚评估是使用来自延时培养箱的视频文件进行的,每个视频包含单个焦点平面的数据。关于移植前不孕的潜在原因和子宫内膜厚度的临床数据不可用,但可能解释了植入失败和 IPG 预测准确性降低的原因。植入被定义为妊娠囊的存在,而胎儿心跳的检测是胚胎活力的更可靠标志物。原始数据被匿名化到无法量化研究中包含的独特患者和周期数量的程度,这可能掩盖了来自有限患者群体的偏倚的影响。此外,缺乏人口统计学数据使得难以得出关于数据集在更广泛人群中的代表性的结论。最后,胚胎学家被要求评估植入潜力,而不是胚胎质量。虽然这不是胚胎评估的传统方法,但形态学/形态动力学被认为是评估胚胎活力的一种手段,与某些方法的植入潜力密切相关。
胚胎选择是体外受精成功的关键因素,仍然是一个挑战。提高预测能力可以帮助优化植入成功率和其他临床结果,并最大限度地减少对患者的经济和情感负担。本研究表明胚胎学家之间的一致性率中等,这可能是由于胚胎评估的主观性。特别是,我们发现,与优质和不良质量胚胎相比,胚胎学家对一般质量胚胎的平均准确性和一致性要低得多。使用数据驱动算法作为辅助工具可能有助于辅助 IVF 专业人员提高成功率,并促进 IVF 诊所急需的标准化。我们的研究结果表明需要进一步研究该领域的技术进步。
研究资金/利益冲突:Embryonics Ltd 是一家以色列公司。该研究的部分资金由以色列创新局提供,资助号为 74556。
无。