Suppr超能文献

支持从文本中识别癌症合成致死性的细胞系名称识别

Cell line name recognition in support of the identification of synthetic lethality in cancer from text.

作者信息

Kaewphan Suwisa, Van Landeghem Sofie, Ohta Tomoko, Van de Peer Yves, Ginter Filip, Pyysalo Sampo

机构信息

Turku Centre for Computer Science (TUCS), 20520 Turku, Finland, Department of Information Technology, University of Turku, 20014, Finland, University of Turku Graduate School (UTUGS), University of Turku, 20014, Finland.

Department of Plant Systems Biology, VIB, Ghent 9000, Belgium, Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent 9052, Belgium.

出版信息

Bioinformatics. 2016 Jan 15;32(2):276-82. doi: 10.1093/bioinformatics/btv570. Epub 2015 Oct 1.

Abstract

MOTIVATION

The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus.

RESULTS

We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers.

AVAILABILITY AND IMPLEMENTATION

The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/.

CONTACT

sukaew@utu.fi.

摘要

动机

文本中细胞系名称的识别与标准化是生物医学文本挖掘研究中的一项重要任务,例如有助于从文献中识别合成致死基因。虽然此前已开发出多种工具来处理细胞系识别问题,但尚不清楚现有系统在诸如从癌症文献中提取合成致死基因这类实际且覆盖范围广泛的应用中能否表现良好。在本研究中,我们重新审视细胞系名称识别任务,在各种资源上评估现有系统和新引入的方法,以获得一个不局限于任何特定子领域的可靠标注器。为支持此任务,我们引入了两个针对细胞系名称进行人工标注的文本集:覆盖范围广泛的语料库Gellus和一个聚焦目标领域的语料库CLL。

结果

我们发现,使用NERsuite能取得最佳性能,它是一个基于条件随机场的机器学习系统,在Gellus语料库上进行训练,并辅以细胞系名称词典。该系统在Gellus测试集上的F值为88.46%,在独立标注的CLL语料库上为85.98%。它进一步大规模应用于24302102篇未标注文章,识别出5181342个细胞系提及,经标准化后得到11755个唯一的细胞系数据库标识符。

可用性与实现

人工标注的数据集、细胞系词典、派生语料库、NERsuite模型以及在未标注文本上的大规模运行结果可在http://turkunlp.github.io/Cell-line-recognition/ 以开放许可获取。

联系方式

sukaew@utu.fi

相似文献

10
Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称:一种机器学习方法。
Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

引用本文的文献

4
OGER++: hybrid multi-type entity recognition.OGER++:混合多类型实体识别
J Cheminform. 2019 Jan 21;11(1):7. doi: 10.1186/s13321-018-0326-3.
7
Usage of cell nomenclature in biomedical literature.生物医学文献中细胞命名法的使用。
BMC Bioinformatics. 2017 Dec 21;18(Suppl 17):561. doi: 10.1186/s12859-017-1978-0.

本文引用的文献

6
Searching for synthetic lethality in cancer.寻找癌症中的合成致死性。
Curr Opin Genet Dev. 2011 Feb;21(1):34-41. doi: 10.1016/j.gde.2010.10.009. Epub 2011 Jan 20.
10
Comparative analysis of five protein-protein interaction corpora.五个蛋白质-蛋白质相互作用语料库的比较分析。
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S6. doi: 10.1186/1471-2105-9-S3-S6.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验