Fundel Katrin, Zimmer Ralf
Institut für Informatik, Ludwig-Maximilians-Universität München, Amalienstrasse 17, 80333 München, Germany.
BMC Bioinformatics. 2006 Aug 9;7:372. doi: 10.1186/1471-2105-7-372.
Frequently, several alternative names are in use for biological objects such as genes and proteins. Applications like manual literature search, automated text-mining, named entity identification, gene/protein annotation, and linking of knowledge from different information sources require the knowledge of all used names referring to a given gene or protein. Various organism-specific or general public databases aim at organizing knowledge about genes and proteins. These databases can be used for deriving gene and protein name dictionaries. So far, little is known about the differences between databases in terms of size, ambiguities and overlap.
We compiled five gene and protein name dictionaries for each of the five model organisms (yeast, fly, mouse, rat, and human) from different organism-specific and general public databases. We analyzed the degree of ambiguity of gene and protein names within and between dictionaries, to a lexicon of common English words and domain-related non-gene terms, and we compared different data sources in terms of size of extracted dictionaries and overlap of synonyms between those. The study shows that the number of genes/proteins and synonyms covered in individual databases varies significantly for a given organism, and that the degree of ambiguity of synonyms varies significantly between different organisms. Furthermore, it shows that, despite considerable efforts of co-curation, the overlap of synonyms in different data sources is rather moderate and that the degree of ambiguity of gene names with common English words and domain-related non-gene terms varies depending on the considered organism.
In conclusion, these results indicate that the combination of data contained in different databases allows the generation of gene and protein name dictionaries that contain significantly more used names than dictionaries obtained from individual data sources. Furthermore, curation of combined dictionaries considerably increases size and decreases ambiguity. The entries of the curated synonym dictionary are available for manual querying, editing, and PubMed- or Google-search via the ProThesaurus-wiki. For automated querying via custom software, we offer a web service and an exemplary client application.
对于诸如基因和蛋白质等生物对象,常常存在多个交替使用的名称。诸如人工文献检索、自动文本挖掘、命名实体识别、基因/蛋白质注释以及来自不同信息源的知识链接等应用,需要了解所有用于指代给定基因或蛋白质的名称。各种特定生物体或通用公共数据库旨在整理有关基因和蛋白质的知识。这些数据库可用于推导基因和蛋白质名称词典。到目前为止,关于数据库在规模、歧义性和重叠方面的差异,人们了解甚少。
我们从不同的特定生物体和通用公共数据库中,为五种模式生物(酵母、果蝇、小鼠、大鼠和人类)分别编制了五本基因和蛋白质名称词典。我们分析了词典内部和之间基因和蛋白质名称的歧义程度,与普通英语单词和领域相关非基因术语的词汇表进行比较,并比较了不同数据源在提取词典的规模以及其中同义词重叠方面的差异。研究表明,对于给定的生物体,各个数据库所涵盖的基因/蛋白质和同义词数量差异显著,并且不同生物体之间同义词的歧义程度差异也很大。此外,研究还表明,尽管在共同编目方面付出了相当大的努力,但不同数据源中同义词的重叠程度相当适中,并且基因名称与普通英语单词和领域相关非基因术语的歧义程度因所考虑的生物体而异。
总之,这些结果表明,不同数据库中所包含数据的组合能够生成基因和蛋白质名称词典,这些词典所包含的常用名称比从单个数据源获得的词典要多得多。此外,对组合词典进行编目可显著增加其规模并减少歧义。经过编目的同义词词典条目可通过ProThesaurus-wiki进行手动查询、编辑以及PubMed或谷歌搜索。对于通过定制软件进行自动查询,我们提供了一个网络服务和一个示例客户端应用程序。