Topole Eva, Biondaro Sonia, Montagna Isabella, Corre Sandrine, Corradi Massimo, Stanojevic Sanja, Graham Brian, Das Nilakash, Ray Kevin, Topalovic Marko
Global Clinical Development, Chiesi Farmaceutici, S.p.A., Parma, Italy.
Department of Medicine and Surgery, University of Parma, Parma, Italy.
ERJ Open Res. 2023 Jan 3;9(1). doi: 10.1183/23120541.00292-2022. eCollection 2023 Jan.
Acquiring high-quality spirometry data in clinical trials is important, particularly when using forced expiratory volume in 1 s or forced vital capacity as primary end-points. In addition to quantitative criteria, the American Thoracic Society (ATS)/European Respiratory Society (ERS) standards include subjective evaluation which introduces inter-rater variability and potential mistakes. We explored the value of artificial intelligence (AI)-based software (ArtiQ.QC) to assess spirometry quality and compared it to traditional over-reading control.
A random sample of 2000 sessions (8258 curves) was selected from Chiesi COPD and asthma trials (n=1000 per disease). Acceptability using the 2005 ATS/ERS standards was determined by over-reader review and by ArtiQ.QC. Additionally, three respiratory physicians jointly reviewed a subset of curves (n=150).
The majority of curves (n=7267, 88%) were of good quality. The AI agreed with over-readers in 91% of cases, with 97% sensitivity and 93% positive predictive value. Performance was significantly better in the asthma group. In the revised subset, n=50 curves were repeated to assess intra-rater reliability (κ=0.83, 0.86 and 0.80 for each of the three reviewers). All reviewers agreed on 63% of 100 unique tests (κ=0.5). When reviewers set the consensus (gold standard), individual agreement with it was 88%, 94% and 70%. The agreement between AI and "gold-standard" was 73%; over-reader agreement was 46%.
AI-based software can be used to measure spirometry data quality with comparable accuracy as experts. The assessment is a subjective exercise, with intra- and inter-rater variability even when the criteria are defined very precisely and objectively. By providing consistent results and immediate feedback to the sites, AI may benefit clinical trial conduct and variability reduction.
在临床试验中获取高质量的肺功能测定数据非常重要,尤其是在将第1秒用力呼气容积或用力肺活量作为主要终点时。除了定量标准外,美国胸科学会(ATS)/欧洲呼吸学会(ERS)标准还包括主观评估,这会引入评分者间的变异性和潜在错误。我们探讨了基于人工智能(AI)的软件(ArtiQ.QC)评估肺功能测定质量的价值,并将其与传统的复查控制进行比较。
从Chiesi慢性阻塞性肺疾病(COPD)和哮喘试验中随机抽取2000个检测时段(8258条曲线)(每种疾病各1000个)。使用2005年ATS/ERS标准的可接受性通过复查者审查和ArtiQ.QC来确定。此外,三名呼吸内科医生共同审查了一部分曲线(n = 150)。
大多数曲线(n = 7267,88%)质量良好。AI在91%的病例中与复查者意见一致,敏感性为97%,阳性预测值为93%。在哮喘组中表现明显更好。在修订后的子集中,重复了n = 50条曲线以评估评分者内信度(三位审查者各自的κ分别为0.83、0.86和0.80)。在100项独特测试中,所有审查者在63%的测试上达成一致(κ = 0.5)。当审查者达成共识(金标准)时,个人与该标准的一致性分别为88%、94%和70%。AI与“金标准”之间的一致性为73%;复查者之间的一致性为46%。
基于AI的软件可用于测量肺功能测定数据质量,其准确性与专家相当。评估是一项主观操作,即使标准定义得非常精确和客观,仍存在评分者内和评分者间的变异性。通过提供一致的结果并立即向各研究点反馈,AI可能有助于临床试验的开展并减少变异性。