Egorov Sergei, Yuryev Anton, Daraselia Nikolai
Ariadne Genomics, Inc, Rockville, MD 20850, USA.
J Am Med Inform Assoc. 2004 May-Jun;11(3):174-8. doi: 10.1197/jamia.M1453. Epub 2004 Feb 5.
The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora.
The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medline "Name-of-Substance" (NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step.
The recall and precision of the system have been determined using 1000 randomly selected and hand-tagged Medline abstracts.
The developed system is capable of identifying protein occurrences in Medline abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84%, respectively.
The developed system appears to be well suited for protein-based Medline indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement also are discussed.
本研究的目的是为生物医学语料库开发一个实用且高效的蛋白质识别系统。
所开发的系统名为ProtScan,它利用精心构建的哺乳动物蛋白质词典,结合一种专门的分词算法,来识别和标记生物医学文本中出现的蛋白质名称,并且还利用了Medline的“物质名称”(NOS)注释。ProtScan的词典是以半自动方式从各种公共领域序列数据库构建的,随后经过深入的专家编纂步骤。
使用1000篇随机选择并人工标注的Medline摘要来确定该系统的召回率和精确率。
所开发的系统能够以98%的精确率和88%的召回率识别Medline摘要中的蛋白质出现情况。还发现它能够每秒处理大约300篇摘要。在不使用NOS注释的情况下,精确率和召回率分别为98.5%和84%。
所开发的系统似乎非常适合基于蛋白质的Medline索引,并且有助于改善生物医学信息检索。还讨论了进一步提高ProtScan召回率的方法。