用于生物知识库半自动更新的文献分类

Literature classification for semi-automated updating of biological knowledgebases.

作者信息

Olsen Lars, Johan Kudahl Ulrich, Winther Ole, Brusic Vladimir

出版信息

BMC Genomics. 2013;14 Suppl 5(Suppl 5):S14. doi: 10.1186/1471-2164-14-S5-S14. Epub 2013 Oct 16.

DOI:10.1186/1471-2164-14-S5-S14

PMID:24564403

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3852072/

Abstract

BACKGROUND

As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature.

RESULTS

We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature.

CONCLUSION

We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases.

摘要

背景

随着生物学检测的输出在分辨率和数量上不断提高，诸如基因和蛋白质序列的功能注释等专业生物学数据主体，使得能够提取生物信息学实际应用所需的更高级知识。虽然常见类型的生物学数据，如序列数据，被广泛存储在生物数据库中，但功能注释，如免疫表位，主要以半结构化格式或嵌入原始科学文献中的自由文本形式存在。

结果

我们定义并应用了一种用于文献分类的机器学习方法，以支持更新肿瘤T细胞抗原知识库TANTIGEN。从PubMed下载摘要，并将其分类为对数据库更新“相关”或“不相关”。在310篇摘要上对k近邻分类器进行训练和五折交叉验证，分类准确率为0.95，从而显示出在支持从文献中提取数据方面的显著价值。

结论

我们在此提出一个概念框架，用于使用文本挖掘和机器学习原理从科学文献中半自动提取表位数据。添加此类数据将有助于生物数据库向知识库的转变。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d648/3852072/53b09f638f06/1471-2164-14-S5-S14-1.jpg

相似文献

Literature classification for semi-automated updating of biological knowledgebases.用于生物知识库半自动更新的文献分类

BMC Genomics. 2013;14 Suppl 5(Suppl 5):S14. doi: 10.1186/1471-2164-14-S5-S14. Epub 2013 Oct 16.

HPVdb: a data mining system for knowledge discovery in human papillomavirus with applications in T cell immunology and vaccinology.HPVdb：一种用于在人乳头瘤病毒中进行知识发现的数据挖掘系统，应用于T细胞免疫学和疫苗学。

Database (Oxford). 2014 Apr 4;2014:bau031. doi: 10.1093/database/bau031. Print 2014.

Automating document classification for the Immune Epitope Database.免疫表位数据库的文档分类自动化

BMC Bioinformatics. 2007 Jul 26;8:269. doi: 10.1186/1471-2105-8-269.

BioReader: a text mining tool for performing classification of biomedical literature.BioReader：一种文本挖掘工具，用于对生物医学文献进行分类。

BMC Bioinformatics. 2019 Feb 4;19(Suppl 13):57. doi: 10.1186/s12859-019-2607-x.

An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.对生物创意（BioCreAtIvE）和基因本体注释（GOA）的基因本体（GO）注释检索的评估。

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-6-S1-S17. Epub 2005 May 24.

Creating knowledgebases to text-mine PUBMED articles using clustering techniques.利用聚类技术创建知识库以对PubMed文章进行文本挖掘。

AMIA Annu Symp Proc. 2003;2003:821.

Extraction of human kinase mutations from literature, databases and genotyping studies.从文献、数据库和基因分型研究中提取人类激酶突变。

BMC Bioinformatics. 2009 Aug 27;10 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2105-10-S8-S1.

TagLine: Information Extraction for Semi-Structured Text in Medical Progress Notes.标语：医学病程记录中半结构化文本的信息提取。

AMIA Annu Symp Proc. 2014 Nov 14;2014:534-43. eCollection 2014.

Extending PubMed searches to ClinicalTrials.gov through a machine learning approach for systematic reviews.通过机器学习方法扩展 PubMed 检索以用于系统评价：ClinicalTrials.gov 的应用。

J Clin Epidemiol. 2018 Nov;103:22-30. doi: 10.1016/j.jclinepi.2018.06.015. Epub 2018 Jul 5.

Towards pathway curation through literature mining--a case study using PharmGKB.通过文献挖掘进行通路编目——以PharmGKB为例的案例研究

Pac Symp Biocomput. 2014:352-63.

引用本文的文献

BioReader: a text mining tool for performing classification of biomedical literature.BioReader：一种文本挖掘工具，用于对生物医学文献进行分类。

BMC Bioinformatics. 2019 Feb 4;19(Suppl 13):57. doi: 10.1186/s12859-019-2607-x.

Precancer Atlas to Drive Precision Prevention Trials.癌前图谱推动精准预防试验。

Cancer Res. 2017 Apr 1;77(7):1510-1541. doi: 10.1158/0008-5472.CAN-16-2346.

TANTIGEN: a comprehensive database of tumor T cell antigens.TANTIGEN：肿瘤T细胞抗原综合数据库。

Cancer Immunol Immunother. 2017 Jun;66(6):731-735. doi: 10.1007/s00262-017-1978-y. Epub 2017 Mar 9.

Characterization of the immunophenotypes and antigenomes of colorectal cancers reveals distinct tumor escape mechanisms and novel targets for immunotherapy.结直肠癌免疫表型和抗原组的特征揭示了独特的肿瘤逃逸机制和免疫治疗新靶点。

Genome Biol. 2015 Mar 31;16(1):64. doi: 10.1186/s13059-015-0620-6.

Characterizing the human hematopoietic CDome.表征人类造血CDome。

Front Genet. 2014 Sep 25;5:331. doi: 10.3389/fgene.2014.00331. eCollection 2014.

Big data analytics in immunology: a knowledge-based approach.免疫学生物数据分析：基于知识的方法。

Biomed Res Int. 2014;2014:437987. doi: 10.1155/2014/437987. Epub 2014 Jun 22.

本文引用的文献

FLAVIdB: A data mining system for knowledge discovery in flaviviruses with direct applications in immunology and vaccinology.黄病毒数据库（FLAVIdB）：一种用于黄病毒知识发现的数据挖掘系统，可直接应用于免疫学和疫苗学。

Immunome Res. 2011;7(3).

The 2013 Nucleic Acids Research Database Issue and the online molecular biology database collection.2013 年核酸研究数据库问题及在线分子生物学数据库资源集合。

Nucleic Acids Res. 2013 Jan;41(Database issue):D1-7. doi: 10.1093/nar/gks1297. Epub 2012 Nov 30.

Parallel detection of antigen-specific T cell responses by combinatorial encoding of MHC multimers.通过 MHC 多聚体的组合编码平行检测抗原特异性 T 细胞反应。

Nat Protoc. 2012 Apr 12;7(5):891-902. doi: 10.1038/nprot.2012.037.

Influenza research database: an integrated bioinformatics resource for influenza research and surveillance.流感研究数据库：流感研究和监测的综合生物信息学资源。

Influenza Other Respir Viruses. 2012 Nov;6(6):404-16. doi: 10.1111/j.1750-2659.2011.00331.x. Epub 2012 Jan 20.

GenBank.GenBank。

Nucleic Acids Res. 2012 Jan;40(Database issue):D48-53. doi: 10.1093/nar/gkr1202. Epub 2011 Dec 5.

UniProt Knowledgebase: a hub of integrated protein data.UniProt 知识库：一个集成蛋白质数据的中心。

Database (Oxford). 2011 Mar 29;2011:bar009. doi: 10.1093/database/bar009. Print 2011.

PubMed and beyond: a survey of web tools for searching biomedical literature.PubMed 及其他：生物医学文献检索网络工具调查。

Database (Oxford). 2011 Jan 18;2011:baq036. doi: 10.1093/database/baq036. Print 2011.

The immune epitope database 2.0.免疫表位数据库 2.0.

Nucleic Acids Res. 2010 Jan;38(Database issue):D854-62. doi: 10.1093/nar/gkp1004. Epub 2009 Nov 11.

Identification of human MHC class I binding peptides using the iTOPIA- epitope discovery system.使用iTOPIA表位发现系统鉴定人类主要组织相容性复合体I类结合肽

Methods Mol Biol. 2009;524:361-7. doi: 10.1007/978-1-59745-450-6_26.

Linked data and provenance in biological data webs.生物数据网络中的关联数据与出处

Brief Bioinform. 2009 Mar;10(2):139-52. doi: 10.1093/bib/bbn044. Epub 2008 Dec 6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于生物知识库半自动更新的文献分类

Literature classification for semi-automated updating of biological knowledgebases.

作者信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献