支持从文本中识别癌症合成致死性的细胞系名称识别

Cell line name recognition in support of the identification of synthetic lethality in cancer from text.

作者信息

Kaewphan Suwisa, Van Landeghem Sofie, Ohta Tomoko, Van de Peer Yves, Ginter Filip, Pyysalo Sampo

机构信息

Turku Centre for Computer Science (TUCS), 20520 Turku, Finland, Department of Information Technology, University of Turku, 20014, Finland, University of Turku Graduate School (UTUGS), University of Turku, 20014, Finland.

Department of Plant Systems Biology, VIB, Ghent 9000, Belgium, Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent 9052, Belgium.

出版信息

Bioinformatics. 2016 Jan 15;32(2):276-82. doi: 10.1093/bioinformatics/btv570. Epub 2015 Oct 1.

DOI:10.1093/bioinformatics/btv570

PMID:26428294

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4708107/

Abstract

MOTIVATION

The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus.

RESULTS

We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers.

AVAILABILITY AND IMPLEMENTATION

The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/.

CONTACT

sukaew@utu.fi.

摘要

动机

文本中细胞系名称的识别与标准化是生物医学文本挖掘研究中的一项重要任务，例如有助于从文献中识别合成致死基因。虽然此前已开发出多种工具来处理细胞系识别问题，但尚不清楚现有系统在诸如从癌症文献中提取合成致死基因这类实际且覆盖范围广泛的应用中能否表现良好。在本研究中，我们重新审视细胞系名称识别任务，在各种资源上评估现有系统和新引入的方法，以获得一个不局限于任何特定子领域的可靠标注器。为支持此任务，我们引入了两个针对细胞系名称进行人工标注的文本集：覆盖范围广泛的语料库Gellus和一个聚焦目标领域的语料库CLL。

结果

我们发现，使用NERsuite能取得最佳性能，它是一个基于条件随机场的机器学习系统，在Gellus语料库上进行训练，并辅以细胞系名称词典。该系统在Gellus测试集上的F值为88.46%，在独立标注的CLL语料库上为85.98%。它进一步大规模应用于24302102篇未标注文章，识别出5181342个细胞系提及，经标准化后得到11755个唯一的细胞系数据库标识符。

可用性与实现

人工标注的数据集、细胞系词典、派生语料库、NERsuite模型以及在未标注文本上的大规模运行结果可在http://turkunlp.github.io/Cell-line-recognition/ 以开放许可获取。

联系方式

sukaew@utu.fi

相似文献

Cell line name recognition in support of the identification of synthetic lethality in cancer from text.支持从文本中识别癌症合成致死性的细胞系名称识别

Bioinformatics. 2016 Jan 15;32(2):276-82. doi: 10.1093/bioinformatics/btv570. Epub 2015 Oct 1.

NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库：一种用于疾病名称识别和概念规范化的资源。

J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.

OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.生物标记器：在生物医学文献中检测、规范和定位生物实体。

Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.

Comparison of character-level and part of speech features for name recognition in biomedical texts.生物医学文本中用于名称识别的字符级特征与词性特征比较。

J Biomed Inform. 2004 Dec;37(6):423-35. doi: 10.1016/j.jbi.2004.08.008.

LINNAEUS: a species name identification system for biomedical literature.林奈氏：生物医学文献的物种名称识别系统。

BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.

S1000: a better taxonomic name corpus for biomedical information extraction.S1000：用于生物医学信息抽取的更好的分类学名称语料库。

Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad369.

TaggerOne: joint named entity recognition and normalization with semi-Markov Models.TaggerOne：使用半马尔可夫模型进行联合命名实体识别与归一化

Bioinformatics. 2016 Sep 15;32(18):2839-46. doi: 10.1093/bioinformatics/btw343. Epub 2016 Jun 9.

A hybrid named entity tagger for tagging human proteins/genes.一种用于标记人类蛋白质/基因的混合命名实体标记器。

Int J Data Min Bioinform. 2014;10(3):315-28. doi: 10.1504/ijdmb.2014.064545.

A method for named entity normalization in biomedical articles: application to diseases and plants.一种生物医学文章中命名实体规范化的方法：应用于疾病和植物

BMC Bioinformatics. 2017 Oct 13;18(1):451. doi: 10.1186/s12859-017-1857-8.

Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称：一种机器学习方法。

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

引用本文的文献

Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments.生物医学文献中实验模型的自动分类，以支持寻找替代动物实验的方法。

J Biomed Semantics. 2023 Sep 1;14(1):13. doi: 10.1186/s13326-023-00292-w.

Consistency enhancement of model prediction on document-level named entity recognition.提高文档级命名实体识别中模型预测的一致性。

Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad361.

Literature-based translation from synthetic lethality screening into therapeutics targets: CD82 is a novel target for mutation in colon cancer.基于文献的从合成致死筛选到治疗靶点的翻译：CD82是结肠癌中的一个新的突变靶点。

Comput Struct Biotechnol J. 2022 Sep 21;20:5287-5295. doi: 10.1016/j.csbj.2022.09.025. eCollection 2022.

OGER++: hybrid multi-type entity recognition.OGER++：混合多类型实体识别

J Cheminform. 2019 Jan 21;11(1):7. doi: 10.1186/s13321-018-0326-3.

Wide-scope biomedical named entity recognition and normalization with CRFs, fuzzy matching and character level modeling.基于条件随机场、模糊匹配和字符级建模的宽领域生物医学命名实体识别和标准化。

Database (Oxford). 2018 Jan 1;2018:1-10. doi: 10.1093/database/bay096.

DataMed - an open source discovery index for finding biomedical datasets.DataMed——一个用于查找生物医学数据集的开源发现索引。

J Am Med Inform Assoc. 2018 Mar 1;25(3):300-308. doi: 10.1093/jamia/ocx121.

Usage of cell nomenclature in biomedical literature.生物医学文献中细胞命名法的使用。

BMC Bioinformatics. 2017 Dec 21;18(Suppl 17):561. doi: 10.1186/s12859-017-1978-0.

Deep learning with word embeddings improves biomedical named entity recognition.使用词嵌入的深度学习可改善生物医学命名实体识别。

Bioinformatics. 2017 Jul 15;33(14):i37-i48. doi: 10.1093/bioinformatics/btx228.

本文引用的文献

Overview of the Cancer Genetics and Pathway Curation tasks of BioNLP Shared Task 2013.2013年生物自然语言处理共享任务的癌症遗传学与通路注释任务概述。

BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S2. doi: 10.1186/1471-2105-16-S10-S2. Epub 2015 Jul 13.

Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts.初步评估 CellFinder 文献整理管道在肾脏细胞和解剖部位基因表达中的应用。

Database (Oxford). 2013 Apr 18;2013:bat020. doi: 10.1093/database/bat020. Print 2013.

Gimli: open source and high-performance biomedical name recognition.金雳：开源的高性能生物医学命名实体识别。

BMC Bioinformatics. 2013 Feb 15;14:54. doi: 10.1186/1471-2105-14-54.

Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011.生物自然语言处理共享任务 2011 的 ID、EPI 和 REL 任务概述。

BMC Bioinformatics. 2012 Jun 26;13 Suppl 11(Suppl 11):S2. doi: 10.1186/1471-2105-13-S11-S2.

The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity.癌症细胞系百科全书使对抗癌药物敏感性的预测建模成为可能。

Nature. 2012 Mar 28;483(7391):603-7. doi: 10.1038/nature11003.

Searching for synthetic lethality in cancer.寻找癌症中的合成致死性。

Curr Opin Genet Dev. 2011 Feb;21(1):34-41. doi: 10.1016/j.gde.2010.10.009. Epub 2011 Jan 20.

COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer.COSMIC：在癌症体细胞突变目录中挖掘完整的癌症基因组。

Nucleic Acids Res. 2011 Jan;39(Database issue):D945-50. doi: 10.1093/nar/gkq929. Epub 2010 Oct 15.

A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature.从文献中提取蛋白质-蛋白质相互作用的核方法综合基准测试

PLoS Comput Biol. 2010 Jul 1;6(7):e1000837. doi: 10.1371/journal.pcbi.1000837.

Cell Line Data Base: structure and recent improvements towards molecular authentication of human cell lines.细胞系数据库：结构及近期在人类细胞系分子鉴定方面的改进

Nucleic Acids Res. 2009 Jan;37(Database issue):D925-32. doi: 10.1093/nar/gkn730. Epub 2008 Oct 15.

Comparative analysis of five protein-protein interaction corpora.五个蛋白质-蛋白质相互作用语料库的比较分析。

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S6. doi: 10.1186/1471-2105-9-S3-S6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验