Qin Yanan, Yi Daiyao, Chen Xianghao, Guan Yuanfang
Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA.
NAR Genom Bioinform. 2021 Oct 4;3(4):lqab089. doi: 10.1093/nargab/lqab089. eCollection 2021 Dec.
More than 110 000 publications have used microarrays to decipher phenotype-associated genes, clinical biomarkers and gene functions. Microarrays rely on digital assaying the fluorescence signals of arrays. In this study, we retrospectively constructed raw images for 37 724 published microarray data, and developed deep learning algorithms to automatically detect systematic defects. We report that an alarming amount of 26.73% of the microarray-based studies are affected by serious imaging defects. By literature mining, we found that publications associated with these affected microarrays have reported disproportionately more biological discoveries on the genes in the contaminated areas compared to other genes. 28.82% of the gene-level conclusions reported in these publications were based on measurements falling into the contaminated area, indicating severe, systematic problems caused by such contaminations. We provided the identified published, problematic datasets, affected genes and the imputed arrays as well as software tools for scanning such contamination that will become essential to future studies to scrutinize and critically analyze microarray data.
超过11万篇出版物使用微阵列来解读与表型相关的基因、临床生物标志物和基因功能。微阵列依靠对阵列的荧光信号进行数字检测。在本研究中,我们回顾性地为37724篇已发表的微阵列数据构建了原始图像,并开发了深度学习算法来自动检测系统缺陷。我们报告称,高达26.73%的基于微阵列的研究受到严重成像缺陷的影响。通过文献挖掘,我们发现,与这些受影响的微阵列相关的出版物报告的受污染区域基因的生物学发现比其他基因多得多。这些出版物中报告的28.82%的基因水平结论是基于落入受污染区域的测量数据,表明此类污染导致了严重的系统性问题。我们提供了已识别的有问题的已发表数据集、受影响的基因、插补阵列以及用于扫描此类污染的软件工具,这些对于未来研究仔细审查和批判性分析微阵列数据至关重要。