Suppr超能文献

新型冠状病毒肺炎数据分析的可重复性:悖论、陷阱及未来挑战

The reproducibility of COVID-19 data analysis: paradoxes, pitfalls, and future challenges.

作者信息

Serio Clelia Di, Malgaroli Antonio, Ferrari Paolo, Kenett Ron S

机构信息

Vita-Salute San Raffaele University, UniSR, Milan, Italy.

University Centre of Statistics in the Biomedical Sciences CUSSB, UniSR, Milan, Italy.

出版信息

PNAS Nexus. 2022 Aug 23;1(3):pgac125. doi: 10.1093/pnasnexus/pgac125. eCollection 2022 Jul.

Abstract

In the midst of the COVID-19 experience, we learned an important scientific lesson: knowledge acquisition and information quality in medicine depends more on "data quality" rather than "data quantity." The large number of COVID-19 reports, published in a very short time, demonstrated that the most advanced statistical and computational tools cannot properly overcome the poor quality of acquired data. The main evidence for this observation comes from the poor reproducibility of results. Indeed, understanding the data generation process is fundamental when investigating scientific questions such as prevalence, immunity, transmissibility, and susceptibility. Most of COVID-19 studies are case reports based on "non probability" sampling and do not adhere to the general principles of controlled experimental designs. Such collected data suffers from many limitations when used to derive clinical conclusions. These include confounding factors, measurement errors and bias selection effects. Each of these elements represents a source of uncertainty, which is often ignored or assumed to provide an unbiased random contribution. Inference retrieved from large data in medicine is also affected by data protection policies that, while protecting patients' privacy, are likely to reduce consistently usefulness of big data in achieving fundamental goals such as effective and efficient data-integration. This limits the degree of generalizability of scientific studies and leads to paradoxical and conflicting conclusions. We provide such examples from assessing the role of risks factors. In conclusion, new paradigms and new designs schemes are needed in order to reach inferential conclusions that are meaningful and informative when dealing with data collected during emergencies like COVID-19.

摘要

在应对新冠疫情的过程中,我们学到了一个重要的科学教训:医学领域的知识获取和信息质量更多地取决于“数据质量”而非“数据数量”。在极短时间内发表的大量新冠疫情报告表明,最先进的统计和计算工具也无法妥善克服所获取数据质量不佳的问题。这一观察结果的主要证据来自结果的可重复性较差。的确,在研究诸如患病率、免疫力、传播性和易感性等科学问题时,了解数据生成过程至关重要。大多数新冠疫情研究都是基于“非概率”抽样的病例报告,并未遵循对照实验设计的一般原则。当用于得出临床结论时,此类收集到的数据存在诸多局限性。这些局限性包括混杂因素、测量误差和偏倚选择效应。这些因素中的每一个都代表了不确定性的来源,而这种不确定性往往被忽视或被假定为提供无偏的随机影响。从医学大数据中得出的推断也受到数据保护政策的影响,这些政策在保护患者隐私的同时,可能会持续降低大数据在实现有效和高效数据整合等基本目标方面的有用性。这限制了科学研究的可推广程度,并导致自相矛盾和相互冲突的结论。我们通过评估风险因素的作用给出了此类示例。总之,在处理像新冠疫情这样的紧急情况期间收集的数据时,需要新的范式和新的设计方案,以便得出有意义且信息丰富的推断性结论。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/369b/9896906/f8aa15a39783/pgac125fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验