Oh Sehyun, Abdelnabi Jasmine, Al-Dulaimi Ragheed, Aggarwal Ayush, Ramos Marcel, Davis Sean, Riester Markus, Waldron Levi
Epidemiology and Biostatistics, Graduate School of Public Health and Health Policy, City University of New York, New York, 10027, USA.
Institute for Implementation Science and Population Health, New York, 10027, USA.
F1000Res. 2020 Dec 21;9:1493. doi: 10.12688/f1000research.28033.2. eCollection 2020.
Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN.
基因符号是基因名称的可识别标识符,但由于别名、手动输入以及电子表格无意中转换为日期格式,它们不稳定且容易出错。官方基因符号资源,如用于人类基因的人类基因命名委员会(HGNC)和用于小鼠基因的小鼠基因组信息学项目(MGI),提供了有效、别名和过时符号的权威来源,但缺乏编程接口,也无法纠正电子表格转换的符号。我们展示了HGNChelper,这是一个R包,它基于HGNC人类和MGI小鼠基因符号数据库,识别已知别名和过时的基因符号,以及电子表格引入的常见错误标注,并尽可能提供更正。HGNChelper在最新的分子特征数据库(MSigDB 7.0)和基因表达综合数据库的平台注释文件中识别出无效基因符号,其发生率从近期平台的约3%到2002 - 2003年最早平台的30 - 40%不等。HGNChelper可从CRAN安装。