Zaninovic Nikica, Sierra Jose T, Malmsten Jonas E, Rosenwaks Zev
Weill Cornell Medicine, Ronald O. Perelman and Claudia Cohen Center for Reproductive Medicine, New York, New York.
QED Analytics, Princeton, New Jersey.
F S Sci. 2024 Feb;5(1):50-57. doi: 10.1016/j.xfss.2023.10.002. Epub 2023 Oct 14.
To evaluate the degree of agreement of embryo ranking between embryologists and eight artificial intelligence (AI) algorithms.
Retrospective study.
PATIENT(S): A total of 100 cycles with at least eight embryos were selected from the Weill Cornell Medicine database. For each embryo, the full-length time-lapse (TL) videos, as well as a single embryo image at 120 hours, were given to five embryologists and eight AI algorithms for ranking.
INTERVENTION(S): None.
MAIN OUTCOME MEASURE(S): Kendall rank correlation coefficient (Kendall's τ).
RESULT(S): Embryologists had a high degree of agreement in the overall ranking of 100 cycles with an average Kendall's tau (K-τ) of 0.70, slightly lower than the interembryologist agreement when using a single image or video (average K-τ = 0.78). Overall agreement between embryologists and the AI algorithms was significantly lower (average K-τ = 0.53) and similar to the observed low inter-AI algorithm agreement (average K-τ = 0.47). Notably, two of the eight algorithms had a very low agreement with other ranking methodologies (average K-τ = 0.05) and between each other (K-τ = 0.01). The average agreement in selecting the best-quality embryo (1/8 in 100 cycles with an expected agreement by random chance of 12.5%; confidence interval [CI]95: 6%-19%) was 59.5% among embryologists and 40.3% for six AI algorithms. The incidence of the agreement for the two algorithms with the low overall agreement was 11.7%. Agreement on selecting the same top two embryos/cycle (expected agreement by random chance corresponds to 25.0%; CI95: 17%-32%) was 73.5% among embryologists and 56.0% among AI methods excluding two discordant algorithms, which had an average agreement of 24.4%, the expected range of agreement by random chance. Intraembryologist ranking agreement (single image vs. video) was 71.7% and 77.8% for single and top two embryos, respectively. Analysis of average raw scores indicated that cycles with low diversity of embryo quality generally resulted in a lower overall agreement between the methods (embryologists and AI models).
CONCLUSION(S): To our knowledge, this is the first study that evaluates the level of agreement in ranking embryo quality between different AI algorithms and embryologists. The different concordance methods were consistent and indicated that the highest agreement was intraembryologist agreement, followed by interembryologist agreement. In contrast, the agreement between some of the AI algorithms and embryologists was similar to the inter-AI algorithm agreement, which also showed a wide range of pairwise concordance. Specifically, two AI models showed intra- and interagreement at the level expected from random selection.
评估胚胎学家与八种人工智能(AI)算法在胚胎排名上的一致程度。
回顾性研究。
从威尔康奈尔医学院数据库中选取了总共100个周期,每个周期至少有8个胚胎。对于每个胚胎,将全长延时(TL)视频以及120小时时的单个胚胎图像提供给五名胚胎学家和八种AI算法进行排名。
无。
肯德尔等级相关系数(肯德尔τ系数)。
胚胎学家在100个周期的总体排名中具有高度一致性,平均肯德尔τ系数(K-τ)为0.70,略低于使用单个图像或视频时胚胎学家之间的一致性(平均K-τ = 0.78)。胚胎学家与AI算法之间的总体一致性显著较低(平均K-τ = 0.53),与观察到的AI算法之间较低的一致性相似(平均K-τ = 0.47)。值得注意的是,八种算法中的两种与其他排名方法的一致性非常低(平均K-τ = 0.05),且它们彼此之间的一致性(K-τ = 0.01)也很低。在选择最佳质量胚胎方面(100个周期中有1/8,随机选择的预期一致性为12.5%;95%置信区间[CI]:6%-19%),胚胎学家之间的平均一致性为59.5%,六种AI算法为40.3%。总体一致性较低的两种算法的一致性发生率为11.7%。在选择相同的前两个胚胎/周期方面(随机选择的预期一致性对应于25.0%;CI95:17%-32%),胚胎学家之间为73.5%,排除两种不一致算法后的AI方法为56.0%,这两种算法的平均一致性为24.4%,即随机选择的预期一致范围。胚胎学家内部的排名一致性(单个图像与视频)对于单个胚胎和前两个胚胎分别为71.7%和77.8%。对平均原始分数的分析表明,胚胎质量多样性低的周期通常导致方法之间(胚胎学家和AI模型)的总体一致性较低。
据我们所知,这是第一项评估不同AI算法与胚胎学家在胚胎质量排名上的一致程度的研究。不同的一致性方法是一致的,表明最高的一致性是胚胎学家内部的一致性,其次是胚胎学家之间的一致性。相比之下,一些AI算法与胚胎学家之间的一致性与AI算法之间的一致性相似,后者也显示出广泛的成对一致性。具体而言,两个AI模型在随机选择预期的水平上显示出内部和相互之间的一致性。