• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用工作流探索和优化化学命名实体识别。

Using workflows to explore and optimise named entity recognition for chemistry.

机构信息

National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, United Kingdom.

出版信息

PLoS One. 2011;6(5):e20181. doi: 10.1371/journal.pone.0020181. Epub 2011 May 25.

DOI:10.1371/journal.pone.0020181
PMID:21633495
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3102085/
Abstract

Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.

摘要

化学文本挖掘工具应具有互操作性和可适应性,无论系统级实现、安装甚至编程问题如何。我们旨在通过可重新配置的工作流程,从底层实现中抽象出这些工具的功能,以自动识别化学名称。为此,我们重构了一个已建立的命名实体识别器(在化学领域),OSCAR,并研究了每个组件对网络性能的影响。我们使用可互操作的文本挖掘框架 U-Compare 从 OSCAR 开发了两个可重新配置的工作流程。这些工作流程可以使用 U-Compare 的图形用户界面的拖放机制进行更改。这些工作流程还提供了一个平台,可以研究文本挖掘组件(如标记化和命名实体识别(使用最大熵马尔可夫模型(MEMM)和基于模式识别的分类器)之间的关系。结果表明,对于化学,特别是消除标记化技术产生的噪声会导致命名实体识别(NER)准确性略高于其他方法。较差的标记化会转化为分类器组件较差的输入,这反过来又会导致 I 型或 II 型错误增加,从而降低整体性能。在 Sciborg 语料库上,使用新标记器同时保留相同的 MEMM 组件的基于工作流程的系统,将 F 分数从 82.35%提高到 84.44%。在 PubMed 语料库上,它记录的 F 分数为 84.84%,而 OSCAR 为 84.23%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/5a6aef12d5c0/pone.0020181.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/aba6bd6acfe2/pone.0020181.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/121c9d257290/pone.0020181.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/4206b7350fa5/pone.0020181.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/c25ce64d02e9/pone.0020181.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/d4f55359caf2/pone.0020181.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/5a6aef12d5c0/pone.0020181.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/aba6bd6acfe2/pone.0020181.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/121c9d257290/pone.0020181.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/4206b7350fa5/pone.0020181.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/c25ce64d02e9/pone.0020181.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/d4f55359caf2/pone.0020181.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f28/3102085/5a6aef12d5c0/pone.0020181.g006.jpg

相似文献

1
Using workflows to explore and optimise named entity recognition for chemistry.使用工作流探索和优化化学命名实体识别。
PLoS One. 2011;6(5):e20181. doi: 10.1371/journal.pone.0020181. Epub 2011 May 25.
2
Cascaded classifiers for confidence-based chemical named entity recognition.用于基于置信度的化学命名实体识别的级联分类器
BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S4. doi: 10.1186/1471-2105-9-S11-S4.
3
Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics.通过预处理分析、知识丰富的特征和启发式方法优化化学命名实体识别。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S6. doi: 10.1186/1758-2946-7-S1-S6. eCollection 2015.
4
Feature selection techniques for maximum entropy based biomedical named entity recognition.基于最大熵的生物医学命名实体识别的特征选择技术。
J Biomed Inform. 2009 Oct;42(5):905-11. doi: 10.1016/j.jbi.2008.12.012. Epub 2009 Jan 23.
5
TaggerOne: joint named entity recognition and normalization with semi-Markov Models.TaggerOne:使用半马尔可夫模型进行联合命名实体识别与归一化
Bioinformatics. 2016 Sep 15;32(18):2839-46. doi: 10.1093/bioinformatics/btw343. Epub 2016 Jun 9.
6
Improving dictionary-based named entity recognition with deep learning.利用深度学习改进基于字典的命名实体识别。
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii45-ii52. doi: 10.1093/bioinformatics/btae402.
7
Enhancing HMM-based biomedical named entity recognition by studying special phenomena.通过研究特殊现象增强基于隐马尔可夫模型的生物医学命名实体识别
J Biomed Inform. 2004 Dec;37(6):411-22. doi: 10.1016/j.jbi.2004.08.005.
8
A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature.基于实体识别工具和统计方法的线性分类器,用于提取蛋白质相互作用文献中的方法。
BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S12. doi: 10.1186/1471-2105-12-S8-S12.
9
Assessment of disease named entity recognition on a corpus of annotated sentences.基于带注释句子语料库的疾病命名实体识别评估。
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.
10
Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称:一种机器学习方法。
Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

引用本文的文献

1
Thalia: semantic search engine for biomedical abstracts.塔利亚:生物医学文摘的语义搜索引擎。
Bioinformatics. 2019 May 15;35(10):1799-1801. doi: 10.1093/bioinformatics/bty871.
2
A Survey of Bioinformatics Database and Software Usage through Mining the Literature.通过文献挖掘对生物信息学数据库和软件使用情况的调查
PLoS One. 2016 Jun 22;11(6):e0157989. doi: 10.1371/journal.pone.0157989. eCollection 2016.
3
Ambiguity and variability of database and software names in bioinformatics.生物信息学中数据库和软件名称的模糊性与变异性。

本文引用的文献

1
Text mining meets workflow: linking U-Compare with Taverna.文本挖掘与工作流程相结合:将 U-Compare 与 Taverna 相连接。
Bioinformatics. 2010 Oct 1;26(19):2486-7. doi: 10.1093/bioinformatics/btq464. Epub 2010 Aug 12.
2
Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining.多源化学词典的自动编目与人工编目:对文本挖掘的影响
J Cheminform. 2010 Jun 3;2(1):4. doi: 10.1186/1758-2946-2-4.
3
CDK-Taverna: an open workflow environment for cheminformatics.CDK-Taverna:一个用于化学信息学的开放工作流环境。
J Biomed Semantics. 2015 Jun 29;6:29. doi: 10.1186/s13326-015-0026-0. eCollection 2015.
4
Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics.通过预处理分析、知识丰富的特征和启发式方法优化化学命名实体识别。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S6. doi: 10.1186/1758-2946-7-S1-S6. eCollection 2015.
5
Text-mining-assisted biocuration workflows in Argo.阿尔戈中基于文本挖掘的生物编目工作流程。
Database (Oxford). 2014 Jul 18;2014. doi: 10.1093/database/bau070. Print 2014.
6
Unregistered biological words recognition by Q-learning with transfer learning.基于迁移学习的Q学习对未注册生物词汇的识别
ScientificWorldJournal. 2014 Feb 19;2014:173290. doi: 10.1155/2014/173290. eCollection 2014.
7
Anatomical entity mention recognition at literature scale.文献级别的解剖实体提及识别。
Bioinformatics. 2014 Mar 15;30(6):868-75. doi: 10.1093/bioinformatics/btt580. Epub 2013 Oct 25.
8
Deploying and sharing U-Compare workflows as web services.将U-Compare工作流程作为网络服务进行部署和共享。
J Biomed Semantics. 2013 Feb 18;4(1):7. doi: 10.1186/2041-1480-4-7.
9
Argo: an integrative, interactive, text mining-based workbench supporting curation.Argonaut:一个集成的、交互的、基于文本挖掘的工作平台,支持管理。
Database (Oxford). 2012 Mar 20;2012:bas010. doi: 10.1093/database/bas010. Print 2012.
10
OSCAR4: a flexible architecture for chemical text-mining.OSCAR4:一种用于化学文本挖掘的灵活架构。
J Cheminform. 2011 Oct 14;3(1):41. doi: 10.1186/1758-2946-3-41.
BMC Bioinformatics. 2010 Mar 29;11:159. doi: 10.1186/1471-2105-11-159.
4
A dictionary to identify small molecules and drugs in free text.用于识别自由文本中小分子和药物的词典。
Bioinformatics. 2009 Nov 15;25(22):2983-91. doi: 10.1093/bioinformatics/btp535. Epub 2009 Sep 16.
5
Interactive text mining with Pipeline Pilot: a bibliographic web-based tool for PubMed.使用管道领航员进行交互式文本挖掘:一种基于网页的PubMed文献工具。
Infect Disord Drug Targets. 2009 Jun;9(3):366-74. doi: 10.2174/1871526510909030366.
6
Extraction of CYP chemical interactions from biomedical literature using natural language processing methods.使用自然语言处理方法从生物医学文献中提取CYP化学相互作用。
J Chem Inf Model. 2009 Feb;49(2):263-9. doi: 10.1021/ci800332w.
7
U-Compare: share and compare text mining tools with UIMA.U-Compare:与 UIMA 共享和比较文本挖掘工具。
Bioinformatics. 2009 Aug 1;25(15):1997-8. doi: 10.1093/bioinformatics/btp289. Epub 2009 May 4.
8
Cascaded classifiers for confidence-based chemical named entity recognition.用于基于置信度的化学命名实体识别的级联分类器
BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S4. doi: 10.1186/1471-2105-9-S11-S4.
9
Detection of IUPAC and IUPAC-like chemical names.检测国际纯粹与应用化学联合会(IUPAC)及类IUPAC化学名称。
Bioinformatics. 2008 Jul 1;24(13):i268-76. doi: 10.1093/bioinformatics/btn181.
10
Scientific workflows as productivity tools for drug discovery.作为药物发现生产力工具的科学工作流程。
Curr Opin Drug Discov Devel. 2008 May;11(3):381-8.