School of Computer Science, University of Manchester, UK.
J Biomed Inform. 2009 Oct;42(5):887-94. doi: 10.1016/j.jbi.2009.04.001. Epub 2009 Apr 11.
Transcription factors (TFs) play a crucial role in gene regulation, and providing structured and curated information about them is important for genome biology. Manual curation of TF related data is time-consuming and always lags behind the actual knowledge available in the biomedical literature. Here we present a machine-learning text mining approach for identification and tagging of protein mentions that play a TF role in a given context to support the curation process. More precisely, the method explicitly identifies those protein mentions in text that refer to their potential TF functions. The prediction features are engineered from the results of shallow parsing and domain-specific processing (recognition of relevant appearing in phrases) and a phrase-based Conditional Random Fields (CRF) model is used to capture the content and context information of candidate entities. The proposed approach for the identification of TF mentions has been tested on a set of evidence sentences from the TRANSFAC and FlyTF databases. It achieved an F-measure of around 51.5% with a precision of 62.5% using 5-fold cross-validation evaluation. The experimental results suggest that the phrase-based CRF model benefits from the flexibility to use correlated domain-specific features that describe the dependencies between TFs and other entities. To the best of our knowledge, this work is one of the first attempts to apply text-mining techniques to the task of assigning semantic roles to protein mentions.
转录因子 (TFs) 在基因调控中起着至关重要的作用,提供关于它们的结构化和精心整理的信息对于基因组生物学非常重要。手动整理 TF 相关数据既耗时又总是落后于生物医学文献中实际可用的知识。在这里,我们提出了一种机器学习文本挖掘方法,用于识别和标记在给定上下文中起 TF 作用的蛋白质提及,以支持整理过程。更确切地说,该方法明确识别文本中那些提及其潜在 TF 功能的蛋白质。预测特征是从浅层解析和特定于领域的处理(识别相关短语中的出现)的结果中设计的,并且使用基于短语的条件随机场 (CRF) 模型来捕获候选实体的内容和上下文信息。我们提出的 TF 提及识别方法已经在 TRANSFAC 和 FlyTF 数据库中的一组证据句子上进行了测试。使用 5 折交叉验证评估,它的 F1 分数约为 51.5%,精度为 62.5%。实验结果表明,基于短语的 CRF 模型受益于使用相关特定领域的功能的灵活性,这些功能描述了 TF 和其他实体之间的依赖关系。据我们所知,这项工作是首次尝试将文本挖掘技术应用于为蛋白质提及分配语义角色的任务。