重大数据分析错误使癌症微生物组研究结果无效。

Major data analysis errors invalidate cancer microbiome findings.

作者信息

Gihawi Abraham, Ge Yuchen, Lu Jennifer, Puiu Daniela, Xu Amanda, Cooper Colin S, Brewer Daniel S, Pertea Mihaela, Salzberg Steven L

机构信息

Norwich Medical School, University of East Anglia, Norwich, UK.

Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA.

出版信息

bioRxiv. 2023 Jul 31:2023.07.28.550993. doi: 10.1101/2023.07.28.550993.

DOI:10.1101/2023.07.28.550993

PMID:37577699

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10418105/

Abstract

We re-analyzed the data from a recent large-scale study that reported strong correlations between microbial organisms and 33 different cancer types, and that created machine learning predictors with near-perfect accuracy at distinguishing among cancers. We found at least two fundamental flaws in the reported data and in the methods: (1) errors in the genome database and the associated computational methods led to millions of false positive findings of bacterial reads across all samples, largely because most of the sequences identified as bacteria were instead human; and (2) errors in transformation of the raw data created an artificial signature, even for microbes with no reads detected, tagging each tumor type with a distinct signal that the machine learning programs then used to create an apparently accurate classifier. Each of these problems invalidates the results, leading to the conclusion that the microbiome-based classifiers for identifying cancer presented in the study are entirely wrong. These flaws have subsequently affected more than a dozen additional published studies that used the same data and whose results are likely invalid as well.

摘要

我们重新分析了近期一项大规模研究的数据。该研究报告了微生物与33种不同癌症类型之间存在强相关性，并创建了在区分癌症方面具有近乎完美准确率的机器学习预测模型。我们在报告的数据和方法中发现了至少两个根本性缺陷：（1）基因组数据库及相关计算方法中的错误导致在所有样本中出现数百万个细菌读数的假阳性结果，主要是因为大多数被鉴定为细菌的序列实际上是人类序列；（2）原始数据转换中的错误产生了一种人为特征，即使对于未检测到读数的微生物也是如此，为每种肿瘤类型标记了一种独特信号，机器学习程序随后利用该信号创建了一个看似准确的分类器。这些问题中的每一个都使结果无效，从而得出结论：该研究中提出的用于识别癌症的基于微生物组的分类器完全错误。这些缺陷随后影响了另外十几项已发表的研究，这些研究使用了相同的数据，其结果可能同样无效。