Kaewphan Suwisa, Van Landeghem Sofie, Ohta Tomoko, Van de Peer Yves, Ginter Filip, Pyysalo Sampo
Turku Centre for Computer Science (TUCS), 20520 Turku, Finland, Department of Information Technology, University of Turku, 20014, Finland, University of Turku Graduate School (UTUGS), University of Turku, 20014, Finland.
Department of Plant Systems Biology, VIB, Ghent 9000, Belgium, Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent 9052, Belgium.
Bioinformatics. 2016 Jan 15;32(2):276-82. doi: 10.1093/bioinformatics/btv570. Epub 2015 Oct 1.
The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus.
We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers.
The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/.
文本中细胞系名称的识别与标准化是生物医学文本挖掘研究中的一项重要任务,例如有助于从文献中识别合成致死基因。虽然此前已开发出多种工具来处理细胞系识别问题,但尚不清楚现有系统在诸如从癌症文献中提取合成致死基因这类实际且覆盖范围广泛的应用中能否表现良好。在本研究中,我们重新审视细胞系名称识别任务,在各种资源上评估现有系统和新引入的方法,以获得一个不局限于任何特定子领域的可靠标注器。为支持此任务,我们引入了两个针对细胞系名称进行人工标注的文本集:覆盖范围广泛的语料库Gellus和一个聚焦目标领域的语料库CLL。
我们发现,使用NERsuite能取得最佳性能,它是一个基于条件随机场的机器学习系统,在Gellus语料库上进行训练,并辅以细胞系名称词典。该系统在Gellus测试集上的F值为88.46%,在独立标注的CLL语料库上为85.98%。它进一步大规模应用于24302102篇未标注文章,识别出5181342个细胞系提及,经标准化后得到11755个唯一的细胞系数据库标识符。
人工标注的数据集、细胞系词典、派生语料库、NERsuite模型以及在未标注文本上的大规模运行结果可在http://turkunlp.github.io/Cell-line-recognition/ 以开放许可获取。