McDonald Ryan, Pereira Fernando
Department of Computer and Information Science, University of Pennsylvania, Levine Hall, 3330 Walnut Street, Philadelphia, Pennsylvania 19104, USA.
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S6. doi: 10.1186/1471-2105-6-S1-S6. Epub 2005 May 24.
We present a model for tagging gene and protein mentions from text using the probabilistic sequence tagging framework of conditional random fields (CRFs). Conditional random fields model the probability P(t/o) of a tag sequence given an observation sequence directly, and have previously been employed successfully for other tagging tasks. The mechanics of CRFs and their relationship to maximum entropy are discussed in detail.
We employ a diverse feature set containing standard orthographic features combined with expert features in the form of gene and biological term lexicons to achieve a precision of 86.4% and recall of 78.7%. An analysis of the contribution of the various features of the model is provided.
我们提出了一种使用条件随机场(CRFs)的概率序列标记框架从文本中标记基因和蛋白质提及的模型。条件随机场直接对给定观察序列的标签序列概率P(t/o)进行建模,并且先前已成功应用于其他标记任务。详细讨论了CRFs的机制及其与最大熵的关系。
我们采用了一个多样化的特征集,其中包含标准拼写特征以及基因和生物学术语词典形式的专家特征,以实现86.4%的精确率和78.7%的召回率。还提供了对模型各种特征贡献的分析。