基于实体识别工具和统计方法的线性分类器，用于提取蛋白质相互作用文献中的方法。

A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature.

机构信息

Institute for Biotechnology and Bioengineering, Centre of Biological Engineering, University of Minho, Braga, Portugal.

出版信息

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S12. doi: 10.1186/1471-2105-12-S8-S12.

DOI:10.1186/1471-2105-12-S8-S12

PMID:22151823

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3269935/

Abstract

BACKGROUND

We participated, as Team 81, in the Article Classification and the Interaction Method subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of the BioCreative III Challenge. For the ACT, we pursued an extensive testing of available Named Entity Recognition and dictionary tools, and used the most promising ones to extend our Variable Trigonometric Threshold linear classifier. Our main goal was to exploit the power of available named entity recognition and dictionary tools to aid in the classification of documents relevant to Protein-Protein Interaction (PPI). For the IMT, we focused on obtaining evidence in support of the interaction methods used, rather than on tagging the document with the method identifiers. We experimented with a primarily statistical approach, as opposed to employing a deeper natural language processing strategy. In a nutshell, we exploited classifiers, simple pattern matching for potential PPI methods within sentences, and ranking of candidate matches using statistical considerations. Finally, we also studied the benefits of integrating the method extraction approach that we have used for the IMT into the ACT pipeline.

RESULTS

For the ACT, our linear article classifier leads to a ranking and classification performance significantly higher than all the reported submissions to the challenge in terms of Area Under the Interpolated Precision and Recall Curve, Mathew's Correlation Coefficient, and F-Score. We observe that the most useful Named Entity Recognition and Dictionary tools for classification of articles relevant to protein-protein interaction are: ABNER, NLPROT, OSCAR 3 and the PSI-MI ontology. For the IMT, our results are comparable to those of other systems, which took very different approaches. While the performance is not very high, we focus on providing evidence for potential interaction detection methods. A significant majority of the evidence sentences, as evaluated by independent annotators, are relevant to PPI detection methods.

CONCLUSIONS

For the ACT, we show that the use of named entity recognition tools leads to a substantial improvement in the ranking and classification of articles relevant to protein-protein interaction. Thus, we show that our substantially expanded linear classifier is a very competitive classifier in this domain. Moreover, this classifier produces interpretable surfaces that can be understood as "rules" for human understanding of the classification. We also provide evidence supporting certain named entity recognition tools as beneficial for protein-interaction article classification, or demonstrating that some of the tools are not beneficial for the task. In terms of the IMT task, in contrast to other participants, our approach focused on identifying sentences that are likely to bear evidence for the application of a PPI detection method, rather than on classifying a document as relevant to a method. As BioCreative III did not perform an evaluation of the evidence provided by the system, we have conducted a separate assessment, where multiple independent annotators manually evaluated the evidence produced by one of our runs. Preliminary results from this experiment are reported here and suggest that the majority of the evaluators agree that our tool is indeed effective in detecting relevant evidence for PPI detection methods. Regarding the integration of both tasks, we note that the time required for running each pipeline is realistic within a curation effort, and that we can, without compromising the quality of the output, reduce the time necessary to extract entities from text for the ACT pipeline by pre-selecting candidate relevant text using the IMT pipeline.

摘要

背景

我们作为团队 81 参加了生物创意 III 挑战赛蛋白质-蛋白质相互作用任务中的文章分类和交互方法子任务（分别为 ACT 和 IMT）。对于 ACT，我们广泛测试了现有的命名实体识别和词典工具，并使用最有前途的工具扩展了我们的变量三角阈值线性分类器。我们的主要目标是利用现有的命名实体识别和词典工具的力量，帮助对与蛋白质-蛋白质相互作用（PPI）相关的文档进行分类。对于 IMT，我们专注于获取支持所使用交互方法的证据，而不是用方法标识符标记文档。我们尝试了一种主要的统计方法，而不是采用更深入的自然语言处理策略。简而言之，我们利用分类器、句子中潜在 PPI 方法的简单模式匹配，以及使用统计考虑因素对候选匹配进行排名。最后，我们还研究了将我们在 IMT 中使用的方法提取方法集成到 ACT 管道中的好处。

结果

对于 ACT，我们的线性文章分类器在基于插值精度和召回曲线、马修相关系数和 F 分数的排名和分类性能方面明显优于挑战赛的所有报告提交结果。我们观察到，对于分类与蛋白质-蛋白质相互作用相关的文章最有用的命名实体识别和词典工具是：ABNER、NLPROT、OSCAR 3 和 PSI-MI 本体。对于 IMT，我们的结果与其他采用非常不同方法的系统相当。虽然性能不是很高，但我们专注于为潜在的交互检测方法提供证据。独立注释者评估的大量证据句子与 PPI 检测方法相关。

结论

对于 ACT，我们表明使用命名实体识别工具可显著提高与蛋白质-蛋白质相互作用相关的文章的排名和分类。因此，我们表明我们大幅扩展的线性分类器在该领域是一个非常有竞争力的分类器。此外，这个分类器生成可解释的表面，可以被理解为人类理解分类的“规则”。我们还提供了支持某些命名实体识别工具有益于蛋白质相互作用文章分类的证据，或者证明某些工具对该任务没有益处。在 IMT 任务方面，与其他参与者不同，我们的方法侧重于识别可能包含 PPI 检测方法应用证据的句子，而不是将文档分类为与方法相关。由于生物创意 III 没有对系统提供的证据进行评估，我们进行了单独的评估，其中多个独立注释者手动评估了我们的一次运行产生的证据。这里报告了来自该实验的初步结果，并表明大多数评估者都认为我们的工具确实有效地检测到了 PPI 检测方法的相关证据。关于两个任务的集成，我们注意到运行每个管道所需的时间在管理工作中是现实的，并且我们可以在不影响输出质量的情况下，通过使用 IMT 管道预先选择相关文本的候选，减少 ACT 管道从文本中提取实体所需的时间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbd6/3269935/9813f0886d06/1471-2105-12-S8-S12-1.jpg

相似文献

A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S12. doi: 10.1186/1471-2105-12-S8-S12.

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S3. doi: 10.1186/1471-2105-12-S8-S3.

Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S11. doi: 10.1186/1471-2105-12-S8-S11.

Overview of the BioCreative III Workshop.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2105-12-S8-S1.

Detection of interaction articles and experimental methods in biomedical literature.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S13. doi: 10.1186/1471-2105-12-S8-S13.

Classification of protein-protein interaction full-text documents using text and citation network features.

IEEE/ACM Trans Comput Biol Bioinform. 2010 Jul-Sep;7(3):400-11. doi: 10.1109/TCBB.2010.55.

Overview of the protein-protein interaction annotation extraction task of BioCreative II.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S4. doi: 10.1186/gb-2008-9-s2-s4. Epub 2008 Sep 1.

Simple and efficient machine learning frameworks for identifying protein-protein interaction relevant articles and experimental methods used to study the interactions.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S10. doi: 10.1186/1471-2105-12-S8-S10.

Active learning for ontological event extraction incorporating named entity recognition and unknown word handling.

J Biomed Semantics. 2016 Apr 27;7:22. doi: 10.1186/s13326-016-0059-z. eCollection 2016.

Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S1. doi: 10.1186/gb-2008-9-s2-s1. Epub 2008 Sep 1.

引用本文的文献

Hagit Shatkay-Reshef 1965-2022.

Bioinform Adv. 2022 Mar 4;2(1):vbac012. doi: 10.1093/bioadv/vbac012. eCollection 2022.

Automatic query generation using word embeddings for retrieving passages describing experimental methods.

Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw166. Print 2017.

Construction of antimicrobial peptide-drug combination networks from scientific literature based on a semi-automated curation workflow.

Database (Oxford). 2016 Dec 26;2016. doi: 10.1093/database/baw143. Print 2016.

Extraction of pharmacokinetic evidence of drug-drug interactions from the literature.

PLoS One. 2015 May 11;10(5):e0122199. doi: 10.1371/journal.pone.0122199. eCollection 2015.

A fast algorithm for determining bounds and accurate approximate p-values of the rank product statistic for replicate experiments.

BMC Bioinformatics. 2014 Nov 21;15(1):367. doi: 10.1186/s12859-014-0367-1.

How to link ontologies and protein-protein interactions to literature: text-mining approaches and the BioCreative experience.

Database (Oxford). 2012 Mar 21;2012:bas017. doi: 10.1093/database/bas017. Print 2012.

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S3. doi: 10.1186/1471-2105-12-S8-S3.

本文引用的文献

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S3. doi: 10.1186/1471-2105-12-S8-S3.

Detection of interaction articles and experimental methods in biomedical literature.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S13. doi: 10.1186/1471-2105-12-S8-S13.

Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S11. doi: 10.1186/1471-2105-12-S8-S11.

Classifying protein-protein interaction articles using word and syntactic features.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S9. doi: 10.1186/1471-2105-12-S8-S9.

An Overview of BioCreative II.5.

IEEE/ACM Trans Comput Biol Bioinform. 2010 Jul-Sep;7(3):385-99. doi: 10.1109/tcbb.2010.61.

Classification of protein-protein interaction full-text documents using text and citation network features.

IEEE/ACM Trans Comput Biol Bioinform. 2010 Jul-Sep;7(3):400-11. doi: 10.1109/TCBB.2010.55.

MINT and IntAct contribute to the Second BioCreative challenge: serving the text-mining community with high quality molecular interaction data.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S5. doi: 10.1186/gb-2008-9-s2-s5. Epub 2008 Sep 1.

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S11. doi: 10.1186/gb-2008-9-s2-s11. Epub 2008 Sep 1.

Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users.

Bioinformatics. 2008 Sep 15;24(18):2086-93. doi: 10.1093/bioinformatics/btn381. Epub 2008 Aug 20.

ChEBI: a database and ontology for chemical entities of biological interest.

Nucleic Acids Res. 2008 Jan;36(Database issue):D344-50. doi: 10.1093/nar/gkm791. Epub 2007 Oct 11.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于实体识别工具和统计方法的线性分类器，用于提取蛋白质相互作用文献中的方法。

A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献