Smidt Nynke, Rutjes Anne W S, van der Windt Daniëlle A W M, Ostelo Raymond W J G, Bossuyt Patrick M, Reitsma Johannes B, Bouter Lex M, de Vet Henrica C w
Institute for Research in Extramural Medicine, VU University Medical Center, Van der Boechorststraat 7, 1081 BT Amsterdam, The Netherlands.
BMC Med Res Methodol. 2006 Mar 15;6:12. doi: 10.1186/1471-2288-6-12.
In January 2003, STAndards for the Reporting of Diagnostic accuracy studies (STARD) were published in a number of journals, to improve the quality of reporting in diagnostic accuracy studies. We designed a study to investigate the inter-assessment reproducibility, and intra- and inter-observer reproducibility of the items in the STARD statement.
Thirty-two diagnostic accuracy studies published in 2000 in medical journals with an impact factor of at least 4 were included. Two reviewers independently evaluated the quality of reporting of these studies using the 25 items of the STARD statement. A consensus evaluation was obtained by discussing and resolving disagreements between reviewers. Almost two years later, the same studies were evaluated by the same reviewers. For each item, percentages agreement and Cohen's kappa between first and second consensus assessments (inter-assessment) were calculated. Intraclass Correlation coefficients (ICC) were calculated to evaluate its reliability.
The overall inter-assessment agreement for all items of the STARD statement was 85% (Cohen's kappa 0.70) and varied from 63% to 100% for individual items. The largest differences between the two assessments were found for the reporting of the rationale of the reference standard (kappa 0.37), number of included participants that underwent tests (kappa 0.28), distribution of the severity of the disease (kappa 0.23), a cross tabulation of the results of the index test by the results of the reference standard (kappa 0.33) and how indeterminate results, missing data and outliers were handled (kappa 0.25). Within and between reviewers, also large differences were observed for these items. The inter-assessment reliability of the STARD checklist was satisfactory (ICC = 0.79 [95% CI: 0.62 to 0.89]).
Although the overall reproducibility of the quality of reporting on diagnostic accuracy studies using the STARD statement was found to be good, substantial disagreements were found for specific items. These disagreements were not so much caused by differences in interpretation of the items by the reviewers but rather by difficulties in assessing the reporting of these items due to lack of clarity within the articles. Including a flow diagram in all reports on diagnostic accuracy studies would be very helpful in reducing confusion between readers and among reviewers.
2003年1月,《诊断准确性研究报告标准》(STARD)在多家期刊上发表,以提高诊断准确性研究报告的质量。我们设计了一项研究,以调查STARD声明中各项内容的评估间可重复性以及观察者内和观察者间可重复性。
纳入2000年发表在影响因子至少为4的医学期刊上的32项诊断准确性研究。两名评审员使用STARD声明的25项内容独立评估这些研究的报告质量。通过讨论和解决评审员之间的分歧达成共识评估。大约两年后,相同的评审员对相同的研究进行评估。对于每个项目,计算首次和第二次共识评估之间(评估间)的一致百分比和科恩kappa系数。计算组内相关系数(ICC)以评估其可靠性。
STARD声明所有项目的总体评估间一致性为85%(科恩kappa系数0.70),各个项目的一致性从63%到100%不等。两次评估之间差异最大的是参考标准原理的报告(kappa系数0.37)、接受测试的纳入参与者数量(kappa系数0.28)、疾病严重程度分布(kappa系数0.23)、索引测试结果与参考标准结果的交叉表(kappa系数0.33)以及不确定结果、缺失数据和异常值的处理方式(kappa系数0.25)。在评审员内部和评审员之间,这些项目也存在很大差异。STARD清单的评估间可靠性令人满意(ICC =
0.79 [95% CI:0.62至0.89])。
虽然使用STARD声明对诊断准确性研究报告质量的总体可重复性良好,但在特定项目上存在重大分歧。这些分歧与其说是由评审员对项目解释的差异导致的,不如说是由于文章中缺乏清晰度而难以评估这些项目的报告。在所有诊断准确性研究报告中纳入流程图将非常有助于减少读者和评审员之间的困惑。