Deakin University, School of Life and Environmental Sciences, Geelong, Australia.
PLoS Comput Biol. 2021 Jul 30;17(7):e1008984. doi: 10.1371/journal.pcbi.1008984. eCollection 2021 Jul.
Erroneous conversion of gene names into other dates and other data types has been a frustration for computational biologists for years. We hypothesized that such errors in supplementary files might diminish after a report in 2016 highlighting the extent of the problem. To assess this, we performed a scan of supplementary files published in PubMed Central from 2014 to 2020. Overall, gene name errors continued to accumulate unabated in the period after 2016. An improved scanning software we developed identified gene name errors in 30.9% (3,436/11,117) of articles with supplementary Excel gene lists; a figure significantly higher than previously estimated. This is due to gene names being converted not just to dates and floating-point numbers, but also to internal date format (five-digit numbers). These findings further reinforce that spreadsheets are ill-suited to use with large genomic data.
多年来,将基因名称错误转换为其他日期和其他数据类型一直令计算生物学家感到沮丧。我们假设,在 2016 年的一份报告强调了这个问题的严重程度之后,补充文件中的此类错误可能会减少。为了评估这一点,我们对 2014 年至 2020 年在 PubMed Central 发表的补充文件进行了扫描。总体而言,2016 年后,基因名称错误仍在持续不断地累积。我们开发的一种改进的扫描软件在带有补充 Excel 基因列表的文章中识别出 30.9%(3,436/11,117)的文章存在基因名称错误;这一数字明显高于之前的估计。这是因为基因名称不仅被转换为日期和浮点数,还被转换为内部日期格式(五位数)。这些发现进一步证实,电子表格不适合与大型基因组数据一起使用。