Weeber Marc, Schijvenaars Bob J, Van Mulligen Erik M, Mons Barend, Jelier Rob, Van Der Eijk Christian C, Kors Jan A
Department of Medical Informatics, Erasmus MC, 3000 DR Rotterdam, The Netherlands.
AMIA Annu Symp Proc. 2003;2003:704-8.
Genes are discovered almost on a daily basis and new names have to be found. Although there are guidelines for gene nomenclature, the naming process is highly creative. Human genes are often named with a gene symbol and a longer, more descriptive term; the short form is very often an abbreviation of the long form. Abbreviations in biomedical language are highly ambiguous, i.e., one gene symbol often refers to more than one gene. Using an existing abbreviation expansion algorithm,we explore MEDLINE for the use of human gene symbols derived from LocusLink. It turns out that just over 40% of these symbols occur in MEDLINE, however, many of these occurrences are not related to genes. Along the process of making an inventory, a disambiguation test collection is constructed automatically.
基因几乎每天都有新发现,因此必须为其寻找新名称。尽管有基因命名的指导原则,但命名过程极具创造性。人类基因通常由基因符号和一个更长、更具描述性的术语来命名;缩写形式往往是较长形式的缩写。生物医学语言中的缩写非常模糊,也就是说,一个基因符号常常指代不止一个基因。利用现有的缩写扩展算法,我们在医学文献数据库(MEDLINE)中搜索源自位点链接(LocusLink)的人类基因符号的使用情况。结果发现,这些符号中略多于40%出现在医学文献数据库中,然而,其中许多出现的情况与基因并无关联。在编制清单的过程中,会自动构建一个消歧测试集。