Department of Epidemiology and Biostatistics, Texas A&M University School of Public Health, College Station, Texas, USA.
Department of Biostatistics and Bioinformatics, Moffitt Cancer Center & Research Institute, Tampa, Florida, USA.
Stat Med. 2022 Apr 15;41(8):1361-1375. doi: 10.1002/sim.9282. Epub 2021 Dec 12.
In pathological studies, subjective assays, especially companion diagnostic tests, can dramatically affect treatment of cancer. Binary diagnostic test results (ie, positive vs negative) may vary between pathologists or observers who read the tumor slides. Some tests have clearly defined criteria resulting in highly concordant outcomes, even with minimal training. Other tests are more challenging. Observers may achieve poor concordance even with training. While there are many statistically rigorous methods for measuring concordance between observers, we are unaware of a method that can identify how many observers are needed to determine whether a test can reach an acceptable concordance, if at all. Here we introduce a statistical approach to the assessment of test performance when the test is read by multiple observers, as would occur in the real world. By plotting the number of observers against the estimated overall agreement proportion, we can obtain a curve that plateaus to the average observer concordance. Diagnostic tests that are well-defined and easily judged show high concordance and plateau with few interobserver comparisons. More challenging tests do not plateau until many interobserver comparisons are made, and typically reach a lower plateau or even 0. We further propose a statistical test of whether the overall agreement proportion will drop to 0 with a large number of pathologists. The proposed analytical framework can be used to evaluate the difficulty in the interpretation of pathological test criteria and platforms, and to determine how pathology-based subjective tests will perform in the real world. The method could also be used outside of pathology, where concordance of a diagnosis or decision point relies on the subjective application of multiple criteria. We apply this method in two recent PD-L1 studies to test whether the curve of overall agreement proportion will converge to 0 and determine the minimal sufficient number of observers required to estimate the concordance plateau of their reads.
在病理研究中,主观检测,尤其是伴随诊断检测,可能会极大地影响癌症的治疗。对肿瘤切片进行阅读的病理学家或观察者之间的二元诊断检测结果(即阳性与阴性)可能存在差异。有些检测具有明确的定义标准,即使经过最少的培训,结果也高度一致。其他检测则更具挑战性。即使经过培训,观察者的一致性也可能较差。虽然有许多统计学上严格的方法可以衡量观察者之间的一致性,但我们不知道是否有一种方法可以确定需要多少观察者才能确定测试是否可以达到可接受的一致性,如果可以的话。在这里,我们引入了一种当测试由多个观察者进行阅读时评估测试性能的统计方法,这种情况在现实世界中会经常发生。通过将观察者的数量与估计的总体一致性比例作图,可以得到一条曲线,该曲线在平均观察者一致性处趋于平稳。定义明确且易于判断的诊断测试显示出高度的一致性,并且只需进行少数几次观察者比较即可达到平稳状态。更具挑战性的测试则需要进行多次观察者比较才能达到平稳状态,并且通常只能达到较低的平稳状态,甚至达到 0。我们进一步提出了一种统计检验,用于检验随着病理学家数量的增加,总体一致性比例是否会降至 0。所提出的分析框架可用于评估病理测试标准和平台解释的难度,并确定基于病理学的主观测试在现实世界中的表现。该方法也可用于其他领域,在这些领域中,诊断或决策点的一致性依赖于多个标准的主观应用。我们在两项最近的 PD-L1 研究中应用了这种方法,以检验总体一致性比例的曲线是否会收敛到 0,并确定估计其读数一致性平稳所需的最小观察者数量。