Ada Health GmbH, Berlin, Germany
Ada Health GmbH, Berlin, Germany.
BMJ Open. 2020 Dec 16;10(12):e040269. doi: 10.1136/bmjopen-2020-040269.
To compare breadth of condition coverage, accuracy of suggested conditions and appropriateness of urgency advice of eight popular symptom assessment apps.
Vignettes study.
200 primary care vignettes.
INTERVENTION/COMPARATOR: For eight apps and seven general practitioners (GPs): breadth of coverage and condition-suggestion and urgency advice accuracy measured against the vignettes' gold-standard.
(1) Proportion of conditions 'covered' by an app, that is, not excluded because the user was too young/old or pregnant, or not modelled; (2) proportion of vignettes with the correct primary diagnosis among the top 3 conditions suggested; (3) proportion of 'safe' urgency advice (ie, at gold standard level, more conservative, or no more than one level less conservative).
Condition-suggestion coverage was highly variable, with some apps not offering a suggestion for many users: in alphabetical order, Ada: 99.0%; Babylon: 51.5%; Buoy: 88.5%; K Health: 74.5%; Mediktor: 80.5%; Symptomate: 61.5%; Your.MD: 64.5%; WebMD: 93.0%. Top-3 suggestion accuracy was GPs (average): 82.1%±5.2%; Ada: 70.5%; Babylon: 32.0%; Buoy: 43.0%; K Health: 36.0%; Mediktor: 36.0%; Symptomate: 27.5%; WebMD: 35.5%; Your.MD: 23.5%. Some apps excluded certain user demographics or conditions and their performance was generally greater with the exclusion of corresponding vignettes. For safe urgency advice, tested GPs had an average of 97.0%±2.5%. For the vignettes with advice provided, only three apps had safety performance within 1 SD of the GPs-Ada: 97.0%; Babylon: 95.1%; Symptomate: 97.8%. One app had a safety performance within 2 SDs of GPs-Your.MD: 92.6%. Three apps had a safety performance outside 2 SDs of GPs-Buoy: 80.0% (p<0.001); K Health: 81.3% (p<0.001); Mediktor: 87.3% (p=1.3×10).
The utility of digital symptom assessment apps relies on coverage, accuracy and safety. While no digital tool outperformed GPs, some came close, and the nature of iterative improvements to software offers scalable improvements to care.
比较八种流行的症状评估应用程序在条件覆盖范围、建议条件的准确性和紧急建议的适当性方面的广度。
情景研究。
200 个初级保健情景。
干预/比较:八种应用程序和七名全科医生(GP):根据情景的金标准衡量覆盖率以及条件建议和紧急建议的准确性。
(1)应用程序“涵盖”的条件比例,即不因为用户太年轻/年老或怀孕,或未建模而排除的条件比例;(2)前 3 个建议条件中正确的主要诊断比例;(3)“安全”紧急建议的比例(即,在金标准水平、更保守或不超过一个级别不那么保守)。
条件建议的覆盖范围差异很大,有些应用程序对许多用户没有提供建议:按字母顺序,Ada:99.0%;Babylon:51.5%;Buoy:88.5%;K Health:74.5%;Mediktor:80.5%;Symptomate:61.5%;Your.MD:64.5%;WebMD:93.0%。顶级 3 建议准确性是 GP(平均值):82.1%±5.2%;Ada:70.5%;Babylon:32.0%;Buoy:43.0%;K Health:36.0%;Mediktor:36.0%;Symptomate:27.5%;WebMD:35.5%;Your.MD:23.5%。一些应用程序排除了某些用户群体或条件,并且在排除相应的情景时,其性能通常更好。对于安全的紧急建议,测试的 GP 平均有 97.0%±2.5%。对于提供建议的情景,只有三个应用程序在安全性方面的表现与 GP 相差 1 个标准差以内-Ada:97.0%;Babylon:95.1%;Symptomate:97.8%。一个应用程序在与 GP 相差 2 个标准差以内具有安全性性能-Your.MD:92.6%。三个应用程序的安全性表现超出了 GP 两个标准差范围-Buoy:80.0%(p<0.001);K Health:81.3%(p<0.001);Mediktor:87.3%(p=1.3×10)。
数字症状评估应用程序的实用性取决于覆盖范围、准确性和安全性。虽然没有一种数字工具优于 GP,但有些工具非常接近,并且软件的迭代改进性质为护理提供了可扩展的改进。