一种基于词典的简单实用方法，用于识别医学在线数据库（Medline）摘要中的蛋白质。

A simple and practical dictionary-based approach for identification of proteins in Medline abstracts.

作者信息

Egorov Sergei, Yuryev Anton, Daraselia Nikolai

机构信息

Ariadne Genomics, Inc, Rockville, MD 20850, USA.

出版信息

J Am Med Inform Assoc. 2004 May-Jun;11(3):174-8. doi: 10.1197/jamia.M1453. Epub 2004 Feb 5.

DOI:10.1197/jamia.M1453

PMID:14764613

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC400515/

Abstract

OBJECTIVE

The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora.

DESIGN

The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medline "Name-of-Substance" (NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step.

MEASUREMENTS

The recall and precision of the system have been determined using 1000 randomly selected and hand-tagged Medline abstracts.

RESULTS

The developed system is capable of identifying protein occurrences in Medline abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84%, respectively.

CONCLUSION

The developed system appears to be well suited for protein-based Medline indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement also are discussed.

摘要

目的

本研究的目的是为生物医学语料库开发一个实用且高效的蛋白质识别系统。

设计

所开发的系统名为ProtScan，它利用精心构建的哺乳动物蛋白质词典，结合一种专门的分词算法，来识别和标记生物医学文本中出现的蛋白质名称，并且还利用了Medline的“物质名称”（NOS）注释。ProtScan的词典是以半自动方式从各种公共领域序列数据库构建的，随后经过深入的专家编纂步骤。

测量

使用1000篇随机选择并人工标注的Medline摘要来确定该系统的召回率和精确率。

结果

所开发的系统能够以98%的精确率和88%的召回率识别Medline摘要中的蛋白质出现情况。还发现它能够每秒处理大约300篇摘要。在不使用NOS注释的情况下，精确率和召回率分别为98.5%和84%。

结论

所开发的系统似乎非常适合基于蛋白质的Medline索引，并且有助于改善生物医学信息检索。还讨论了进一步提高ProtScan召回率的方法。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

一种基于词典的简单实用方法，用于识别医学在线数据库（Medline）摘要中的蛋白质。

A simple and practical dictionary-based approach for identification of proteins in Medline abstracts.

作者信息

机构信息

出版信息

OBJECTIVE

DESIGN

MEASUREMENTS

RESULTS

CONCLUSION

目的

设计

测量

结果

结论

相似文献

引用本文的文献

本文引用的文献

一种基于词典的简单实用方法，用于识别医学在线数据库（Medline）摘要中的蛋白质。

A simple and practical dictionary-based approach for identification of proteins in Medline abstracts.

作者信息

机构信息

出版信息

OBJECTIVE

DESIGN

MEASUREMENTS

RESULTS

CONCLUSION

目的

设计

测量

结果

结论

相似文献

引用本文的文献

本文引用的文献