通过社区投票和数据库驱动的文本分类增强生物医学数据库中的导航。

Enhancing navigation in biomedical databases by community voting and database-driven text classification.

机构信息

Center for Molecular Imaging Research, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA, USA.

出版信息

BMC Bioinformatics. 2009 Oct 3;10:317. doi: 10.1186/1471-2105-10-317.

DOI:10.1186/1471-2105-10-317

PMID:19799796

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2768718/

Abstract

BACKGROUND

The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them.

RESULTS

Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly.

CONCLUSION

Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases.The system can be accessed at http://pepbank.mgh.harvard.edu.

摘要

背景

生物数据库及其信息内容的广度呈指数级增长。不幸的是，我们查询这些资源的能力往往仍然不尽如人意。在这里，我们引入并应用了社区投票、数据库驱动的文本分类和可视化辅助工具，作为整合分布式专家知识、自动对数据库条目进行分类以及高效检索的手段。

结果

我们使用先前开发的肽数据库作为示例，比较了几种机器学习算法在将已发表文献结果的摘要分类为与肽研究相关的类别（如与癌症、血管生成、分子成像等相关或不相关）的能力。袋装决策树的集成最符合我们应用的要求。在比较测试中，没有其他算法始终表现更好。此外，我们表明，该算法产生了有意义的类别概率估计值，这些估计值可用于在检索过程中可视化自动分类的置信度。为了允许查看通过自动分类丰富的搜索结果列表，我们在 Web 界面中添加了动态热图。我们利用社区知识，使用户能够以 Web 2.0 风格投票，以纠正自动分类错误，这会触发所有条目的重新分类。我们使用了一种新颖的框架，其中数据库“驱动”整个投票聚合和重新分类过程，以在节省计算资源的同时提高速度并保持方法的可扩展性。在我们的实验中，我们通过向几乎完全标记的实例添加各种级别的噪声来模拟社区投票，并表明在这种情况下，分类可以得到显著改善。

结论

我们使用 PepBank 作为模型数据库，展示了如何构建一个分类辅助检索系统，该系统从社区收集训练数据，完全由数据库控制，与并发更改事件很好地扩展，并且可以适应将文本分类功能添加到其他生物医学数据库。该系统可在 http://pepbank.mgh.harvard.edu 上访问。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f7b5/2768718/84b9f598df96/1471-2105-10-317-3.jpg

相似文献

Enhancing navigation in biomedical databases by community voting and database-driven text classification.通过社区投票和数据库驱动的文本分类增强生物医学数据库中的导航。

BMC Bioinformatics. 2009 Oct 3;10:317. doi: 10.1186/1471-2105-10-317.

PepBank--a database of peptides based on sequence text mining and public peptide data sources.肽库（PepBank）——一个基于序列文本挖掘和公共肽数据源的肽数据库。

BMC Bioinformatics. 2007 Aug 1;8:280. doi: 10.1186/1471-2105-8-280.

Ranking the whole MEDLINE database according to a large training set using text indexing.使用文本索引根据一个大型训练集对整个MEDLINE数据库进行排名。

BMC Bioinformatics. 2005 Mar 24;6:75. doi: 10.1186/1471-2105-6-75.

BioReader: a text mining tool for performing classification of biomedical literature.BioReader：一种文本挖掘工具，用于对生物医学文献进行分类。

BMC Bioinformatics. 2019 Feb 4;19(Suppl 13):57. doi: 10.1186/s12859-019-2607-x.

Automatic discovery and classification of bioinformatics Web sources.

Bioinformatics. 2003 Oct 12;19(15):1927-33. doi: 10.1093/bioinformatics/btg353.

METIS: multiple extraction techniques for informative sentences.METIS：用于提取信息性句子的多种提取技术。

Bioinformatics. 2005 Nov 15;21(22):4196-7. doi: 10.1093/bioinformatics/bti675. Epub 2005 Sep 13.

Query Chem: a Google-powered web search combining text and chemical structures.查询化学：一种由谷歌驱动的结合文本和化学结构的网络搜索工具。

Bioinformatics. 2006 Jul 1;22(13):1670-3. doi: 10.1093/bioinformatics/btl155. Epub 2006 May 3.

Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users.生物医学文本的多维分类：致力于为不同用户自动提供实用价值高的文本。

Bioinformatics. 2008 Sep 15;24(18):2086-93. doi: 10.1093/bioinformatics/btn381. Epub 2008 Aug 20.

Textpresso: an ontology-based information retrieval and extraction system for biological literature.Textpresso：一个基于本体的生物文献信息检索与提取系统。

PLoS Biol. 2004 Nov;2(11):e309. doi: 10.1371/journal.pbio.0020309. Epub 2004 Sep 21.

SPIRS: a Web-based image retrieval system for large biomedical databases.SPIRS：一个用于大型生物医学数据库的基于网络的图像检索系统。

Int J Med Inform. 2009 Apr;78 Suppl 1(Suppl 1):S13-24. doi: 10.1016/j.ijmedinf.2008.09.006. Epub 2008 Nov 8.

引用本文的文献

Proteogenomic analysis reveals RNA as a source for tumor-agnostic neoantigen identification.基于蛋白质组和基因组的分析揭示了 RNA 作为肿瘤不可知的新抗原鉴定来源。

Nat Commun. 2023 Aug 2;14(1):4632. doi: 10.1038/s41467-023-39570-7.

Can the SARS-CoV-2 Spike Protein Bind Integrins Independent of the RGD Sequence?SARS-CoV-2 刺突蛋白能否不依赖 RGD 序列结合整合素？

Front Cell Infect Microbiol. 2021 Nov 18;11:765300. doi: 10.3389/fcimb.2021.765300. eCollection 2021.

Common Amino Acid Subsequences in a Universal Proteome--Relevance for Food Science.通用蛋白质组中的常见氨基酸序列——与食品科学的相关性。

Int J Mol Sci. 2015 Sep 1;16(9):20748-73. doi: 10.3390/ijms160920748.

Efficient Identification of Murine M2 Macrophage Peptide Targeting Ligands by Phage Display and Next-Generation Sequencing.通过噬菌体展示和新一代测序高效鉴定靶向小鼠M2巨噬细胞的肽配体

Bioconjug Chem. 2015 Aug 19;26(8):1811-7. doi: 10.1021/acs.bioconjchem.5b00344. Epub 2015 Jul 28.

Ligand-directed profiling of organelles with internalizing phage libraries.利用内化噬菌体文库对细胞器进行配体导向分析。

Curr Protoc Protein Sci. 2015 Feb 2;79:30.4.1-30.4.30. doi: 10.1002/0471140864.ps3004s79.

Investigation of the novel lead of melanocortin 1 receptor for pigmentary disorders.黑素细胞刺激素 1 受体新型配体在色素紊乱疾病中的研究。

Evid Based Complement Alternat Med. 2014;2014:254678. doi: 10.1155/2014/254678. Epub 2014 Feb 17.

Combinatorial peptide libraries: mining for cell-binding peptides.组合肽库：筛选细胞结合肽

Chem Rev. 2014 Jan 22;114(2):1020-81. doi: 10.1021/cr400166n. Epub 2013 Dec 3.

Practical tips for construction of custom Peptide libraries and affinity selection by using commercially available phage display cloning systems.利用市售噬菌体展示克隆系统构建定制肽库及进行亲和筛选的实用技巧。

J Nucleic Acids. 2012;2012:295719. doi: 10.1155/2012/295719. Epub 2012 Sep 9.

Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation.基于代价敏感的层次文档分类对 PubMed 文摘进行人工审核分诊。

BMC Bioinformatics. 2011 Dec 19;12:482. doi: 10.1186/1471-2105-12-482.

本文引用的文献

The RNA WikiProject: community annotation of RNA families.RNA维基计划：RNA家族的社区注释

RNA. 2008 Dec;14(12):2462-4. doi: 10.1261/rna.1200508. Epub 2008 Oct 22.

Text mining for biology--the way forward: opinions from leading scientists.生物学文本挖掘——前进的道路：顶尖科学家的观点

Genome Biol. 2008;9 Suppl 2(Suppl 2):S7. doi: 10.1186/gb-2008-9-s2-s7. Epub 2008 Sep 1.

Overview of the protein-protein interaction annotation extraction task of BioCreative II.生物创意II蛋白质-蛋白质相互作用注释提取任务概述。

Genome Biol. 2008;9 Suppl 2(Suppl 2):S4. doi: 10.1186/gb-2008-9-s2-s4. Epub 2008 Sep 1.

Integrating and annotating the interactome using the MiMI plugin for cytoscape.使用用于Cytoscape的MiMI插件整合和注释蛋白质相互作用组。

Bioinformatics. 2009 Jan 1;25(1):137-8. doi: 10.1093/bioinformatics/btn501. Epub 2008 Sep 23.

A gene wiki for community annotation of gene function.用于基因功能社区注释的基因维基。

PLoS Biol. 2008 Jul 8;6(7):e175. doi: 10.1371/journal.pbio.0060175.

Calling on a million minds for community annotation in WikiProteins.召集百万智慧进行WikiProteins中的社区注释。

Genome Biol. 2008;9(5):R89. doi: 10.1186/gb-2008-9-5-r89. Epub 2008 May 28.

Preserving accuracy in GenBank.保持GenBank中的准确性。

Science. 2008 Mar 21;319(5870):1616. doi: 10.1126/science.319.5870.1616a.

ORegAnno: an open-access community-driven resource for regulatory annotation.ORegAnno：一个由社区驱动的开放获取的调控注释资源。

Nucleic Acids Res. 2008 Jan;36(Database issue):D107-13. doi: 10.1093/nar/gkm967. Epub 2007 Nov 15.

IDBD: infectious disease biomarker database.IDBD：传染病生物标志物数据库。

Nucleic Acids Res. 2008 Jan;36(Database issue):D455-60. doi: 10.1093/nar/gkm925. Epub 2007 Nov 3.

CBioC: beyond a prototype for collaborative annotation of molecular interactions from the literature.CBioC：超越文献中分子相互作用协作注释的原型。

Comput Syst Bioinformatics Conf. 2007;6:381-4. doi: 10.1142/9781860948732_0038.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过社区投票和数据库驱动的文本分类增强生物医学数据库中的导航。

Enhancing navigation in biomedical databases by community voting and database-driven text classification.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献