Podowski Raf M, Cleary John G, Goncharoff Nicholas T, Amoutzias Gregory, Hayes William S
AstraZeneca R&D Boston and Karolinska Institutet.
Proc IEEE Comput Syst Bioinform Conf. 2004:415-24. doi: 10.1109/csb.2004.1332454.
Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system is described which is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7,344 produced good quality models (F-measure > 0.7, nearly 60% of which were > 0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the system's internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.
由于缺乏标准的基因和蛋白质命名规范,研究人员在进行文献检索时往往要耗费很长时间,有时甚至徒劳无功。本文描述了一种系统,该系统能够在之前未见过的MEDLINE摘要中自动为基因指定其基因定位链接数据库标识(LLID)。该系统基于监督学习,为每个LLID构建一个模型。所有LLID的训练集均自动从基因定位链接数据库和瑞士蛋白质数据库中的MEDLINE参考文献中提取。对所有20546个具有LLID的人类基因的性能进行了验证。其中,7344个产生了高质量模型(F值>0.7,其中近60%大于0.9),13202个没有,主要是由于已知文献参考数量不足。对一组66个基因的MEDLINE文档进行人工验证,结果与系统的内部准确性评估结果高度一致。结论是,使用可扩展的自动化技术可以实现高质量的基因消歧。