Chang Jeffrey T, Schütze Hinrich, Altman Russ B
Department of Genetics, Stanford Medical Center, 300 Pasteur Drive, Lane L 301, Mail Code 5120, Stanford, CA 94305-5120, USA.
Bioinformatics. 2004 Jan 22;20(2):216-25. doi: 10.1093/bioinformatics/btg393.
New high-throughput technologies have accelerated the accumulation of knowledge about genes and proteins. However, much knowledge is still stored as written natural language text. Therefore, we have developed a new method, GAPSCORE, to identify gene and protein names in text. GAPSCORE scores words based on a statistical model of gene names that quantifies their appearance, morphology and context.
We evaluated GAPSCORE against the Yapex data set and achieved an F-score of 82.5% (83.3% recall, 81.5% precision) for partial matches and 57.6% (58.5% recall, 56.7% precision) for exact matches. Since the method is statistical, users can choose score cutoffs that adjust the performance according to their needs.
GAPSCORE is available at http://bionlp.stanford.edu/gapscore/
新的高通量技术加速了关于基因和蛋白质知识的积累。然而,许多知识仍以书面自然语言文本的形式存储。因此,我们开发了一种新方法GAPSCORE,用于识别文本中的基因和蛋白质名称。GAPSCORE基于基因名称的统计模型对单词进行评分,该模型量化了它们的出现频率、形态和上下文。
我们针对Yapex数据集对GAPSCORE进行了评估,部分匹配的F值为82.5%(召回率83.3%,精确率81.5%),完全匹配的F值为57.6%(召回率58.5%,精确率56.7%)。由于该方法是基于统计的,用户可以根据自己的需求选择调整性能的分数阈值。
GAPSCORE可在http://bionlp.stanford.edu/gapscore/获取。